arigedon/dpo-qwen-cot-merged
The arigedon/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-Instruct variant, fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve the quality of structured responses. This model is designed for tasks requiring aligned and coherent outputs, leveraging its 32768 token context length.
Loading preview...
Overview
This model, arigedon/dpo-qwen-cot-merged, is a 4 billion parameter language model based on Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library to align its responses with preferred outputs.
Key Capabilities
- Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, aiming for more structured and logical outputs.
- Improved Response Quality: DPO training focuses on generating higher quality and more aligned responses.
- Direct Usage: Provided as a full-merged 16-bit model, eliminating the need for adapter loading and simplifying deployment with
transformers.
Training Details
The model underwent 2 epochs of DPO training with a learning rate of 5e-07 and a beta of 0.2, using a maximum sequence length of 2024. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset.
Licensing
This model operates under the MIT License, consistent with its training data. Users must also adhere to the original base model's license terms.