CriteriaPO/qwen2.5-3b-dpo-coarse

Warm
Public
3.1B
BF16
32768
May 4, 2025
Hugging Face
Overview

Model Overview

CriteriaPO/qwen2.5-3b-dpo-coarse is a 3.1 billion parameter language model developed by CriteriaPO. It is a fine-tuned variant of the CriteriaPO/qwen2.5-3b-sft-10 model, specifically trained using Direct Preference Optimization (DPO). This training methodology, detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," aims to align the model's outputs more closely with human preferences.

Key Capabilities

  • Preference Alignment: Enhanced through DPO training, making its responses more aligned with desired human feedback.
  • Text Generation: Capable of generating coherent and contextually relevant text based on prompts.
  • Qwen2.5 Architecture: Leverages the foundational capabilities of the Qwen2.5 model family.
  • Context Length: Supports a substantial context window of 32768 tokens, allowing for processing longer inputs and generating more extended outputs.

Training Details

The model was trained using the TRL (Transformer Reinforcement Learning) framework, version 0.12.2. The DPO method directly optimizes a language model to act as its own reward model, simplifying the preference learning process. This approach is particularly effective for improving the quality and safety of generated text by learning directly from preference data.

Use Cases

This model is suitable for applications requiring a compact yet capable language model that has been optimized for preference alignment. It can be used for various text generation tasks where output quality and adherence to specific preferences are important.