allenai/llama-3-tulu-v2.5-8b-uf-mean-70b-uf-rm
The allenai/llama-3-tulu-v2.5-8b-uf-mean-70b-uf-rm is an 8 billion parameter language model from AllenAI, built upon the Meta Llama 3 architecture. It is a Tulu V2.5 series model, fine-tuned using Proximal Policy Optimization (PPO) with a 70B UltraFeedback Reward Model on the UltraFeedback dataset. This model is designed as a helpful assistant, excelling in chat-based applications and demonstrating strong performance in general conversational tasks.
Loading preview...
Model Overview
The allenai/llama-3-tulu-v2.5-8b-uf-mean-70b-uf-rm is an 8 billion parameter language model developed by AllenAI, part of the Tulu V2.5 series. This model is built on the Meta Llama 3 architecture and is specifically fine-tuned using Proximal Policy Optimization (PPO). Its training leverages the UltraFeedback dataset, utilizing per-aspect/fine-grained scores for preference learning, guided by a 70 billion parameter UltraFeedback Reward Model.
Key Capabilities
- Helpful Assistant: Designed to function as a helpful assistant, making it suitable for conversational AI and instruction-following tasks.
- PPO Fine-tuning: Benefits from PPO training with a large 70B parameter reward model, enhancing its alignment with human preferences.
- Llama 3 Base: Utilizes the Llama 3 base model, providing a strong foundation for general language understanding and generation.
- Chat Format: Optimized for a specific chat input format (
<|user|> Your message here! <|assistant|>), with a provided chat template for consistent performance.
Performance Highlights
While an 8B model, it achieves a competitive AlpacaEval 2 Winrate (LC) of 28.8, outperforming some larger 13B Tulu V2.5 models in this metric. For detailed evaluation and training specifics, refer to the associated paper: Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback.
Intended Uses
This model is intended for use in applications requiring a helpful, instruction-following assistant. It is important to note that, like other Tulu models, it has not been explicitly aligned for safety within the RLHF phase and may produce problematic outputs if prompted to do so.