Model Overview
CriteriaPO/qwen2.5-3b-dpo-finegrained is a 3.1 billion parameter language model developed by CriteriaPO. It is a fine-tuned iteration of the CriteriaPO/qwen2.5-3b-sft-10 base model, specifically optimized using Direct Preference Optimization (DPO). This training methodology, detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," aims to align the model's outputs more closely with human preferences.
Key Capabilities
- Preference-Aligned Text Generation: The DPO fine-tuning process enhances the model's ability to generate responses that are preferred by humans, making it suitable for applications requiring nuanced and contextually appropriate output.
- Instruction Following: Building upon its SFT base, the DPO fine-tuning further refines its capacity to understand and execute complex instructions.
- Extended Context Window: With a context length of 32,768 tokens, the model can process and generate text based on extensive input, supporting more complex and longer-form interactions.
Training Details
The model was trained using the TRL (Transformer Reinforcement Learning) library, a framework for applying reinforcement learning techniques to transformer models. The DPO method directly optimizes a policy to maximize the likelihood of preferred responses over dispreferred ones, without the need for a separate reward model. This approach contributes to its ability to produce high-quality, human-aligned text.
Good For
- Applications requiring models that generate responses aligned with human preferences.
- Conversational AI and chatbots where response quality and naturalness are critical.
- Tasks benefiting from a model with a substantial context window for understanding long prompts or conversations.