duong942001/dpo-qwen-cot-merged1
The duong942001/dpo-qwen-cot-merged1 is a 4 billion parameter Qwen3-based causal language model, fine-tuned by duong942001 using Direct Preference Optimization (DPO) with the Unsloth library. This model is specifically optimized for enhancing reasoning capabilities through Chain-of-Thought (CoT) and improving the quality of structured responses. It is designed to provide aligned and preferred outputs for complex reasoning tasks, leveraging its 40960 token context length.
Loading preview...
Model Overview
This model, duong942001/dpo-qwen-cot-merged1, is a 4 billion parameter language model based on Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned by duong942001 using Direct Preference Optimization (DPO) via the Unsloth library. The fine-tuning process focused on aligning the model's responses with preferred outputs, specifically targeting improvements in reasoning (Chain-of-Thought) and the quality of structured responses.
Key Capabilities
- Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, aiming for more logical and structured thought processes in its outputs.
- Improved Response Quality: Fine-tuned to produce higher quality, preferred outputs, particularly for structured response generation.
- Direct Preference Optimization (DPO): Utilizes DPO for alignment, leveraging a preference dataset to guide its learning.
- Merged Weights: Provided as a full-merged 16-bit model, eliminating the need for adapter loading and simplifying deployment with
transformers.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. It was trained with a maximum sequence length of 1024, using a LoRA configuration (r=8, alpha=16) that has since been merged into the base model. The training data used was [u-10bei/dpo-dataset-qwen-cot].
Good For
- Applications requiring strong reasoning capabilities.
- Generating structured and high-quality responses based on user preferences.
- Developers looking for a ready-to-use, merged Qwen3-based model for inference without complex setup.