Tulu V2.5 DPO 13B - AlpacaFarm Human Preferences
This model is a 13 billion parameter language model from AllenAI, fine-tuned from meta-llama/Llama-2-13b-hf. It belongs to the Tulu V2.5 suite, which focuses on creating helpful assistant models through advanced alignment techniques.
Key Capabilities & Training
- Preference Alignment: The model is specifically trained using Direct Preference Optimization (DPO) on the
alpaca_farm_human_pref dataset, aiming to align its outputs with human preferences. - Base Model: It builds upon the Tulu 2 suite, initially fine-tuned on a filtered mix of publicly available, synthetic, and human-created datasets.
- Input Format: Designed to work with a specific chat template:
<|user|> Your message here! <|assistant|> for optimal generation quality. - Research Focus: This model is a product of research detailed in the paper "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback," exploring effective methods for learning from preference feedback.
Intended Uses & Limitations
This model is intended for use as a helpful assistant, particularly in scenarios where human preference alignment is crucial. However, it's important to note that the Tulu models have not undergone in-the-loop filtering for safety like some commercial models, meaning they can produce problematic outputs if prompted to do so. Users should be aware of these limitations and implement appropriate safeguards for deployment.