Overview
This model, sho-nakamura/dpo-qwen-cot-merged, is a 4 billion parameter variant of the Qwen3 architecture, developed by sho-nakamura. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, building upon the sho-nakamura/qwen3-4b-instruct-sft-lora-structured base model. The DPO training focused on aligning the model's responses with preferred outputs, specifically enhancing its reasoning (Chain-of-Thought) and structured response generation capabilities.
Key Capabilities
- Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) prompting to improve logical deduction.
- Structured Output: Excels at generating responses in a structured format, based on preference datasets.
- Full-Merged Weights: Provided as full-merged 16-bit weights, eliminating the need for adapter loading.
- Qwen3 Base: Leverages the robust foundation of the Qwen3-4B-Instruct model.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. It utilized a maximum sequence length of 1024 during training. The LoRA configuration (r=8, alpha=16) was merged into the base model.
Good For
- Applications requiring strong reasoning abilities.
- Tasks where structured and formatted outputs are crucial.
- Developers looking for a Qwen3-based model with improved CoT and structured response generation.