Tulu V2.5 PPO 13B - UltraFeedback Mean w. 13B mixture RM
This model is a 13 billion parameter language model from the Tulu V2.5 series, developed by AllenAI and fine-tuned from meta-llama/Llama-2-13b-hf. It is specifically trained using Proximal Policy Optimization (PPO), leveraging a Tulu v2.5 13B preference mixture reward model and UltraFeedback prompts for alignment. The training methodology is detailed in the paper "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback" (arXiv:2406.09279).
Key Capabilities & Features
- Assistant-like Behavior: Trained to act as a helpful assistant, suitable for conversational AI and instruction-following tasks.
- RLHF Tuned: Utilizes PPO for reinforcement learning from human feedback, building upon the Tulu 2 suite.
- English Language Support: Primarily focused on English language generation.
- Apache 2.0 License: Available for broad use under the Apache 2.0 license.
- Standardized Input Format: Designed to work optimally with a specific chat template:
<|user|> Your message here! <|assistant|> .
Intended Uses & Limitations
This model is suitable for applications requiring a robust, instruction-tuned conversational agent. It was initially fine-tuned on a diverse mix of human-created instructions and synthetic dialogues. However, it's important to note that the Tulu models have not been aligned for generating safe completions within the RLHF phase or deployed with in-the-loop filtering, meaning it can produce problematic outputs if prompted to do so. Users should implement their own safety measures.