chenyongxi/Qwen2.5-1.5B-SFT-DPO-InfinityPreference

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 3, 2026Architecture:Transformer Cold

The chenyongxi/Qwen2.5-1.5B-SFT-DPO-InfinityPreference model is a 1.5 billion parameter language model, fine-tuned from the Qwen2.5 architecture. It was trained using Direct Preference Optimization (DPO) on the BAAI/Infinity-Preference dataset, specializing in generating responses aligned with human preferences. This model is designed for tasks requiring nuanced, preference-based text generation, offering a compact solution for DPO-tuned applications with a 32768 token context length.

Loading preview...

Model Overview

The chenyongxi/Qwen2.5-1.5B-SFT-DPO-InfinityPreference is a 1.5 billion parameter language model based on the Qwen2.5 architecture. This model has been specifically fine-tuned using Direct Preference Optimization (DPO), a method designed to align language models with human preferences without the need for a separate reward model. The training utilized the BAAI/Infinity-Preference dataset, making it adept at generating responses that are preferred by humans.

Key Capabilities

  • Preference-Aligned Generation: Excels at producing text outputs that are aligned with human preferences, thanks to its DPO training.
  • Compact Size: With 1.5 billion parameters, it offers a more efficient solution compared to larger models while still benefiting from preference tuning.
  • Extended Context Window: Supports a context length of 32768 tokens, allowing for processing and generating longer sequences of text.

Training Details

The model was trained using the TRL library, leveraging the DPO method as described in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". This approach directly optimizes a policy to maximize the probability of preferred responses over dispreferred ones, making the model highly effective for tasks where human feedback is crucial.

Use Cases

This model is particularly well-suited for applications requiring:

  • Chatbots and Conversational AI: Generating more natural and preferred conversational responses.
  • Content Generation: Creating text that is more likely to be favored by users.
  • Preference-based Ranking: Tasks where outputs need to be ranked according to human-like preferences.