Overview
kennaka1112/dpo-qwen-cot-merged is a 4 billion parameter model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its LoRA adapters merged into the base model for direct use. The training focused on aligning the model's responses with preferred outputs, specifically targeting improvements in reasoning and structured response generation.
Key Capabilities
- Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, enabling more logical and step-by-step problem-solving.
- Improved Structured Responses: Designed to produce higher quality and more structured outputs based on preference datasets.
- Direct Use: Provided as full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment with
transformers.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-05 and a beta value of 0.2. A maximum sequence length of 1024 was used during training. The base model's license terms (MIT License) apply to this fine-tuned version.
Good For
- Applications requiring strong logical reasoning and Chain-of-Thought capabilities.
- Scenarios where structured and high-quality responses are critical.
- Developers looking for a readily deployable Qwen3-based model with enhanced reasoning.