Tulu V2.5 DPO 13B - UltraFeedback Overall
This model is a 13 billion parameter language model developed by AllenAI, building upon the Tulu 2 suite. It is fine-tuned from meta-llama/Llama-2-13b-hf and specifically aligned using Direct Preference Optimization (DPO) on the ultrafeedback_overall split of the UltraFeedback dataset. The training methodology focuses on learning from preference feedback, as detailed in the paper "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback" arXiv:2406.09279.
Key Capabilities
- Helpful Assistant: Trained to act as a conversational assistant, responding to diverse instructions.
- Preference-Aligned: Utilizes DPO with UltraFeedback data for improved response quality based on human preferences.
- English Language Support: Primarily designed for English language tasks.
Intended Uses & Limitations
The model is suitable for general-purpose conversational AI applications. It was initially fine-tuned on a mix of human-created and synthetic dialogues. Users should be aware that, like many LLMs, it has not undergone extensive safety alignment in the RLHF phase and may produce problematic outputs if specifically prompted to do so. The model expects inputs formatted with <|user|> and <|assistant|> tags, including a newline after <|assistant|> for optimal generation quality.