Model Overview
The chenyongxi/Qwen2.5-1.5B-DPO-1.5B is a 1.5 billion parameter language model, fine-tuned from an unspecified base Qwen2.5 model. It was trained using Direct Preference Optimization (DPO), a method that aligns language model outputs with human preferences by treating the preference data as implicit reward signals. The training utilized the BAAI/Infinity-Preference dataset and the TRL (Transformers Reinforcement Learning) library.
Key Capabilities
- Preference Alignment: Optimized through DPO to generate responses that are aligned with human preferences, making it suitable for conversational AI and interactive applications.
- Text Generation: Capable of generating coherent and contextually relevant text based on given prompts.
- Long Context Understanding: Supports a context length of 32768 tokens, allowing it to process and generate text for longer inputs.
Training Details
The model's training procedure involved:
- Methodology: Direct Preference Optimization (DPO), as detailed in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (arXiv:2305.18290).
- Dataset: Fine-tuned on the
BAAI/Infinity-Preference dataset. - Frameworks: Developed using TRL (version 0.28.0.dev0), Transformers (version 4.56.2), Pytorch (version 2.8.0+cu128), Datasets (version 3.0.0), and Tokenizers (version 0.22.2).
Good For
- Applications requiring preference-aligned text generation.
- Conversational agents and chatbots where response quality and human-likeness are important.
- Tasks benefiting from a model trained with DPO for improved output quality.