kikiyaa/qwen-dpo-finetuned-ver2
The kikiyaa/qwen-dpo-finetuned-ver2 is a 7.6 billion parameter causal language model, fine-tuned from Qwen/Qwen2.5-7B by kikiyaa. This model leverages Direct Preference Optimization (DPO) for enhanced performance, utilizing a context length of 32768 tokens. It is designed for general text generation tasks, benefiting from preference-based training to produce more aligned and helpful responses.
Loading preview...
Overview
The kikiyaa/qwen-dpo-finetuned-ver2 is a 7.6 billion parameter language model, building upon the base architecture of Qwen/Qwen2.5-7B. Developed by kikiyaa, this model has undergone further fine-tuning using the Direct Preference Optimization (DPO) method, as introduced in the paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (2305.18290). This training approach aims to align the model's outputs more closely with human preferences.
Key Capabilities
- Preference-tuned Responses: Utilizes DPO for generating outputs that are aligned with specified preferences, potentially leading to more helpful and desirable text.
- General Text Generation: Capable of various text generation tasks, leveraging its 7.6 billion parameters and a substantial context window of 32768 tokens.
- TRL Framework: Trained using the TRL (Transformers Reinforcement Learning) library, indicating a robust and established training pipeline.
Training Details
The model's fine-tuning process specifically employed DPO, a technique that directly optimizes a language model to act as its own reward model. This method is known for its effectiveness in improving model alignment without requiring a separate reward model. The training was conducted using TRL version 1.1.0, with Transformers 5.5.4 and Pytorch 2.9.1+cu128.