Tulu V2.5 DPO 13B - UltraFeedback Mean: A Helpful Assistant Model
This model, developed by AllenAI, is a 13 billion parameter language model fine-tuned from meta-llama/Llama-2-13b-hf. It is part of the Tulu V2.5 series, which focuses on creating helpful assistant models through advanced alignment techniques.
Key Capabilities and Training
- RLHF Tuned Chat Model: Tulu V2.5 is an RLHF (Reinforcement Learning from Human Feedback) tuned chat model, designed to act as a helpful assistant.
- DPO and PPO Training: The model was trained using a combination of Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO), starting from the Tulu 2 suite.
- UltraFeedback Dataset: This specific model variant was trained on the UltraFeedback dataset, utilizing the average of fine-grained scores to determine chosen and rejected responses, as detailed in the paper "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback" (arXiv:2406.09279).
- Input Format: It is optimized for a specific chat format:
<|user|> Your message here! <|assistant|> .
Intended Uses and Limitations
- Assistant Applications: Intended for use in applications requiring a helpful AI assistant capable of engaging in diverse dialogues.
- Bias and Risks: The model has not been explicitly aligned for safety during the RLHF phase, meaning it may produce problematic outputs, especially when prompted to do so. Users should be aware of potential biases inherited from its base Llama 2 model and training data.