KawausoHiroKawauso/dpo-qwen-cot-merged is a 4 billion parameter language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized for improving reasoning capabilities, particularly Chain-of-Thought (CoT), and generating high-quality structured responses. It leverages a preference dataset to align its outputs with desired formats and logical flows, making it suitable for tasks requiring structured and reasoned answers.
Loading preview...
Model Overview
This model, KawausoHiroKawauso/dpo-qwen-cot-merged, is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model, eliminating the need for adapter loading.
Key Optimizations
The primary objective of this DPO fine-tuning was to enhance the model's ability to generate improved reasoning (Chain-of-Thought) and produce high-quality structured responses. This was achieved by aligning the model's outputs with a specific preference dataset (u-10bei/dpo-dataset-qwen-cot).
Training Configuration
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Method: Direct Preference Optimization (DPO)
- Epochs: 1
- Learning Rate: 1e-05
- Max Sequence Length: 1024
Ideal Use Cases
This model is particularly well-suited for applications requiring:
- Enhanced Reasoning: Tasks that benefit from explicit, step-by-step logical deductions.
- Structured Output Generation: Scenarios where responses need to adhere to specific formats or structures.
- Preference Alignment: Use cases where model outputs should closely match human-preferred examples for quality and coherence.