allenai/tulu-v2.5-ppo-13b-nectar-60k
The allenai/tulu-v2.5-ppo-13b-nectar-60k model is a 13 billion parameter language model developed by AllenAI, fine-tuned from Llama-2-13b-hf. It is part of the Tulu V2.5 series, trained using PPO on a 60k subsample of the Nectar dataset, specifically designed to function as a helpful assistant. This model focuses on learning from preference feedback, leveraging a reward model trained on the same Nectar split to enhance its conversational capabilities.
Loading preview...
Overview
allenai/tulu-v2.5-ppo-13b-nectar-60k is a 13 billion parameter language model developed by AllenAI, building upon the meta-llama/Llama-2-13b-hf base model. It is a member of the Tulu V2.5 suite, which emphasizes training with DPO and PPO from preference feedback. This specific model was fine-tuned using PPO on a 60,000-sample subset of the Nectar dataset, utilizing a dedicated 13B reward model for alignment.
Key Capabilities
- Helpful Assistant: Designed and trained to act as a conversational assistant.
- Preference Learning: Leverages Proximal Policy Optimization (PPO) with a reward model for improved alignment based on preference feedback.
- Instruction Following: Initially fine-tuned on a diverse mix of human-created instructions and synthetic dialogues from the Tulu V2 dataset.
- Specific Input Format: Optimized for a
<|user|>and<|assistant|>chat template, requiring a newline after<|assistant|>for optimal generation quality.
Good For
- Applications requiring a helpful, instruction-following AI assistant.
- Research into PPO-based alignment methods and learning from preference feedback, as detailed in the associated paper: Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback.
Limitations
- The model has not been explicitly aligned for safety during the RLHF phase and lacks in-the-loop filtering, meaning it can produce problematic outputs if prompted to do so.