Tulu V2.5 PPO 13B - HH-RLHF 60k Overview
This model is a 13 billion parameter language model from the Tulu V2.5 series, developed by AllenAI. It is fine-tuned from meta-llama/Llama-2-13b-hf and specifically aligned using Proximal Policy Optimization (PPO) on a 60,000-sample subset of the HH-RLHF dataset. The training process involved a 13B reward model trained on the same HH-RLHF split, with the same prompts reused during PPO training. This model is designed to act as a helpful assistant, building upon the Tulu 2 suite of models.
Key Characteristics
- Architecture: Fine-tuned from Llama 2 13B, part of a suite of RLHF-tuned chat models.
- Alignment Method: Utilizes PPO (Proximal Policy Optimization) for alignment, as detailed in the paper "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback."
- Training Data: Initially fine-tuned on a filtered version of the Tulu V2 mix dataset, then further aligned on the
hh_rlhf_60k split of the allenai/tulu-2.5-preference-data dataset. - Input Format: Requires a specific chat template:
<|user|> Your message here! <|assistant|> for optimal generation quality.
Intended Use and Considerations
This model is intended for use as a helpful assistant. Developers should be aware that, unlike some other models, Tulu models have not been aligned with in-the-loop filtering for safety during the RLHF phase. Consequently, the model may produce problematic outputs, especially when explicitly prompted to do so. The license for this model is Apache 2.0.