allenai/tulu-v2.5-dpo-13b-shp2
allenai/tulu-v2.5-dpo-13b-shp2 is a 13 billion parameter language model developed by AllenAI, fine-tuned from Meta's Llama-2-13b-hf. This model is part of the Tulu V2.5 series, specifically aligned using DPO (Direct Preference Optimization) on the SHP-2 dataset. It is designed to function as a helpful assistant, building upon the Tulu 2 suite of RLHF-tuned chat models.
Loading preview...
Model Overview
allenai/tulu-v2.5-dpo-13b-shp2 is a 13 billion parameter language model developed by AllenAI, serving as a helpful assistant. It is a member of the Tulu V2.5 series, which are models fine-tuned using Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO), originating from the Tulu 2 suite. This specific iteration is trained on the SHP-2 dataset using DPO, building upon the meta-llama/Llama-2-13b-hf base model.
Key Capabilities & Training
- Assistant-Oriented: Designed to act as a helpful assistant, leveraging RLHF tuning.
- DPO Alignment: Utilizes Direct Preference Optimization on the
shp_2split of the allenai/tulu-2.5-preference-data dataset for alignment. - Input Format: Requires a specific chat template for optimal performance:
<|user|> Your message here! <|assistant|>. - Research Focus: Developed as part of research into disentangling best practices for learning from preference feedback, detailed in the paper "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback".
Limitations
- Safety Alignment: The model has not undergone extensive safety alignment during the RLHF phase, and lacks in-the-loop filtering, meaning it can produce problematic outputs, especially when prompted to do so.