KazumaTsuboi/dpo-qwen-cot-merged
KazumaTsuboi/dpo-qwen-cot-merged is a 4 billion parameter causal language model, fine-tuned from Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve the quality of structured responses. It is designed for tasks requiring improved logical progression and coherent, well-formatted outputs.
Loading preview...
Model Overview
KazumaTsuboi/dpo-qwen-cot-merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its LoRA adapters merged into the base model for direct use.
Key Optimizations
This model's primary optimization focused on improving two critical areas:
- Reasoning (Chain-of-Thought): Enhanced ability to generate logical, step-by-step thought processes.
- Structured Response Quality: Improved coherence and formatting of outputs based on preference datasets.
Training Details
The DPO training was conducted for 1 epoch with a learning rate of 1e-06 and a beta value of 0.05. The maximum sequence length used during training was 1024 tokens. The model is provided as full-merged 16-bit weights, eliminating the need for separate adapter loading.
Intended Use
This model is suitable for applications where robust reasoning and high-quality, structured outputs are paramount, particularly in tasks benefiting from Chain-of-Thought prompting.