CriteriaPO/qwen2.5-3b-dpo-finegrained

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:May 4, 2025Architecture:Transformer Warm

CriteriaPO/qwen2.5-3b-dpo-finegrained is a 3.1 billion parameter language model, fine-tuned by CriteriaPO using Direct Preference Optimization (DPO) on top of the Qwen2.5-3B-SFT-10 base model. This model is designed to generate high-quality, preference-aligned text, leveraging its 32K token context length for nuanced responses. Its primary strength lies in producing outputs that align with human preferences, making it suitable for conversational AI and instruction-following tasks.

Loading preview...

Model Overview

CriteriaPO/qwen2.5-3b-dpo-finegrained is a 3.1 billion parameter language model developed by CriteriaPO. It is a fine-tuned iteration of the CriteriaPO/qwen2.5-3b-sft-10 base model, specifically optimized using Direct Preference Optimization (DPO). This training methodology, detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," aims to align the model's outputs more closely with human preferences.

Key Capabilities

  • Preference-Aligned Text Generation: The DPO fine-tuning process enhances the model's ability to generate responses that are preferred by humans, making it suitable for applications requiring nuanced and contextually appropriate output.
  • Instruction Following: Building upon its SFT base, the DPO fine-tuning further refines its capacity to understand and execute complex instructions.
  • Extended Context Window: With a context length of 32,768 tokens, the model can process and generate text based on extensive input, supporting more complex and longer-form interactions.

Training Details

The model was trained using the TRL (Transformer Reinforcement Learning) library, a framework for applying reinforcement learning techniques to transformer models. The DPO method directly optimizes a policy to maximize the likelihood of preferred responses over dispreferred ones, without the need for a separate reward model. This approach contributes to its ability to produce high-quality, human-aligned text.

Good For

  • Applications requiring models that generate responses aligned with human preferences.
  • Conversational AI and chatbots where response quality and naturalness are critical.
  • Tasks benefiting from a model with a substantial context window for understanding long prompts or conversations.