Model Overview
This model, allenai/llama-3-tulu-v2.5-8b-uf-mean-8b-uf-rm, is an 8 billion parameter language model from AllenAI's Tulu V2.5 series, based on Meta's Llama 3 architecture. It is specifically trained as a helpful assistant using Proximal Policy Optimization (PPO). The training utilized the UltraFeedback dataset, employing fine-grained aspect scores for preference learning, and incorporated an 8B reward model also trained on UltraFeedback.
Key Capabilities and Training
- Architecture: Built on Meta Llama 3, part of the Tulu V2.5 suite which updates the original Tulu 2 series.
- Alignment: Aligned using PPO, a reinforcement learning technique, with a dedicated 8B reward model.
- Dataset: Trained on the
ultrafeedback_mean_aspects split of the UltraFeedback dataset, focusing on preference feedback. - Performance: Achieves 61.5% accuracy on GSM8k 8-shot CoT, indicating proficiency in mathematical reasoning tasks.
- Input Format: Designed to work with a specific chat template:
<|user|> Your message here! <|assistant|> (note the required newline after <|assistant|>).
Use Cases and Considerations
This model is suitable for applications requiring a helpful, instruction-following assistant, particularly where mathematical reasoning is important. As an update to the Tulu V2.5 suite, it offers a Llama 3-based alternative to previous Tulu models. Developers should be aware that, like other Tulu models, it has not undergone extensive safety alignment beyond the RLHF phase, and thus may produce problematic outputs if specifically prompted. For more detailed information on its training and evaluation, refer to the associated paper: Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback.