kazuyamaa/dpo-qwen-cot-merged
The kazuyamaa/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based instruction-tuned causal language model, fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized to improve reasoning capabilities through Chain-of-Thought (CoT) and enhance structured response quality. This model is designed for applications requiring improved logical reasoning and coherent, well-structured outputs.
Loading preview...
Model Overview
The kazuyamaa/dpo-qwen-cot-merged model is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library, to enhance its performance.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, allowing for more logical and step-by-step problem-solving.
- Structured Response Quality: Focuses on generating higher quality, more coherent, and structured outputs based on preference datasets.
- Full-Merged Weights: The repository provides full-merged 16-bit weights, eliminating the need for adapter loading during deployment.
Training Details
The model underwent DPO training for 1 epoch with a learning rate of 1e-06 and a beta value of 0.1. The maximum sequence length used during training was 2048 tokens. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset, which is licensed under the MIT License. Users must also adhere to the original base model's license terms.
When to Use This Model
This model is particularly well-suited for applications where improved reasoning, logical consistency, and structured output generation are critical. Its DPO fine-tuning makes it a strong candidate for tasks requiring high-quality, aligned responses.