Nao-Taka/dpo-qwen-cot-merged is a 4 billion parameter Qwen3-based causal language model fine-tuned using Direct Preference Optimization (DPO). This model specializes in improving reasoning capabilities through Chain-of-Thought (CoT) and generating structured responses. It leverages a 40960 token context length and is optimized for tasks requiring enhanced logical processing and coherent output.
Loading preview...
Overview
Nao-Taka/dpo-qwen-cot-merged is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, leading to more logical and structured outputs.
- Preference Alignment: Utilizes DPO to align model responses with preferred outputs, based on a specific preference dataset.
- Structured Response Quality: Focuses on generating higher quality, structured responses.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 5e-06 and a beta value of 0.1. It was trained with a maximum sequence length of 1024, using a LoRA configuration (r=8, alpha=16) that has since been merged into the base model. The training data used for DPO was u-10bei/dpo-dataset-qwen-cot.
Licensing
This model operates under the MIT License, as per the terms of its training dataset. Users are also required to comply with the original base model's license terms.