Overview
allenai/llama-3-tulu-2-dpo-8b is an 8 billion parameter language model from AllenAI, built upon the Meta Llama 3 base model. It is designed to function as a helpful assistant, having undergone a two-stage fine-tuning process. Initially, it was fine-tuned on the Tulu V2 dataset, a diverse mix of publicly available, synthetic, and human-created instructions and dialogues. Subsequently, the model was further aligned using Direct Preference Optimization (DPO) on the UltraFeedback dataset, which contains 64,000 prompts and GPT-4 ranked model completions.
Key Capabilities
- Assistant-like Interactions: Optimized for generating helpful and coherent responses in conversational settings.
- Preference Alignment: Enhanced through DPO training on human preferences, aiming for improved response quality and alignment.
- English Language Focus: Primarily developed and optimized for English natural language processing tasks.
Performance Highlights
While its MMLU and GSM8k scores are competitive, the DPO training significantly improves its AlpacaEval 1 score to 93.02 and TruthfulQA %Info+True to 0.698 compared to the base Llama 3 8B model. It also shows strong performance on Codex HumanEval Pass@10 at 0.688.
Intended Uses
This model is suitable for applications requiring a helpful and aligned conversational AI. Users should be aware that, like many models not explicitly aligned for safety through extensive RLHF, it may produce problematic outputs if specifically prompted to do so. The model expects inputs formatted with <|user|> and <|assistant|> tags, with a crucial newline after <|assistant|> for optimal generation quality.