TaHiTaHiTa/dpo-qwen-cot-merged
TaHiTaHiTa/dpo-qwen-cot-merged is a 4 billion parameter Qwen3-based causal language model, fine-tuned using Direct Preference Optimization (DPO) by TaHiTaHiTa. It is specifically optimized to improve reasoning capabilities through Chain-of-Thought (CoT) and enhance structured response quality. This model is designed for tasks requiring improved logical coherence and adherence to preferred output formats.
Loading preview...
Model Overview
TaHiTaHiTa/dpo-qwen-cot-merged is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized through DPO to improve Chain-of-Thought (CoT) reasoning, leading to more logical and coherent outputs.
- Structured Response Quality: Aligned with preferred outputs to produce higher quality, structured responses.
- Efficient Fine-tuning: Utilizes DPO with a specific training configuration (1 epoch, 5e-06 learning rate, beta 0.1, max sequence length 768) on a specialized preference dataset.
Training Details
The model was trained on the u-10bei/dpo-dataset-qwen-cot dataset. The training focused on aligning the model's responses with human preferences, particularly for reasoning and structured output tasks. The LoRA configuration (r=8, alpha=16) was merged into the base model during the fine-tuning process.
Licensing
This model operates under the MIT License, as per the terms of its training dataset. Users are also required to comply with the original base model's license terms.