Model Overview
allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm is a 13 billion parameter language model from the Tulu V2.5 series, developed by AllenAI. It is fine-tuned from the meta-llama/Llama-2-13b-hf base model and aligned using Proximal Policy Optimization (PPO). A key differentiator is its training on the UltraFeedback dataset, utilizing per-aspect/fine-grained scores and a powerful 70B parameter UltraFeedback Reward Model (RM) during the PPO process.
Key Capabilities & Performance
- Generalist Assistant: Designed to act as a helpful assistant across a wide range of tasks.
- PPO Alignment: Leverages PPO with a 70B RM for enhanced alignment, as detailed in the paper "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback" (arXiv:2406.09279).
- Strong Performance: This 13B model matches or surpasses the performance of Tulu 2+DPO 13B and, in some cases, even Tulu 2+DPO 70B, particularly in AlpacaEval 2 winrate (26.7% vs. 21.2%).
- Input Format: Expects a specific chat format:
<|user|> Your message here! <|assistant|> for optimal generation quality.
Intended Uses & Limitations
This model is suitable for general assistant-like applications. However, it has not been explicitly aligned for safety like models such as ChatGPT, meaning it may produce problematic outputs if prompted to do so. Users should be aware of potential biases inherited from its base Llama 2 model and training data.