CriteriaPO/qwen2.5-3b-dpo-coarse

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:May 4, 2025Architecture:Transformer Warm

CriteriaPO/qwen2.5-3b-dpo-coarse is a 3.1 billion parameter language model fine-tuned from CriteriaPO/qwen2.5-3b-sft-10. This model utilizes Direct Preference Optimization (DPO) for training, enhancing its ability to align with human preferences. It is designed for general text generation tasks, building upon the Qwen2.5 architecture with a 32768 token context length.

Loading preview...

Model Overview

CriteriaPO/qwen2.5-3b-dpo-coarse is a 3.1 billion parameter language model developed by CriteriaPO. It is a fine-tuned variant of the CriteriaPO/qwen2.5-3b-sft-10 model, specifically trained using Direct Preference Optimization (DPO). This training methodology, detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," aims to align the model's outputs more closely with human preferences.

Key Capabilities

  • Preference Alignment: Enhanced through DPO training, making its responses more aligned with desired human feedback.
  • Text Generation: Capable of generating coherent and contextually relevant text based on prompts.
  • Qwen2.5 Architecture: Leverages the foundational capabilities of the Qwen2.5 model family.
  • Context Length: Supports a substantial context window of 32768 tokens, allowing for processing longer inputs and generating more extended outputs.

Training Details

The model was trained using the TRL (Transformer Reinforcement Learning) framework, version 0.12.2. The DPO method directly optimizes a language model to act as its own reward model, simplifying the preference learning process. This approach is particularly effective for improving the quality and safety of generated text by learning directly from preference data.

Use Cases

This model is suitable for applications requiring a compact yet capable language model that has been optimized for preference alignment. It can be used for various text generation tasks where output quality and adherence to specific preferences are important.