The sokosokobe/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based causal language model fine-tuned using Direct Preference Optimization (DPO). It is specifically optimized to improve reasoning capabilities through Chain-of-Thought (CoT) and enhance structured response quality. This model is designed for tasks requiring improved logical progression and coherent, well-structured outputs.
Loading preview...
Model Overview
The sokosokobe/dpo-qwen-cot-merged model is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its full 16-bit weights merged for direct use without adapters.
Key Optimizations
This model's primary optimization focuses on enhancing:
- Reasoning (Chain-of-Thought): Improved ability to generate logical, step-by-step reasoning processes.
- Structured Response Quality: Better coherence and organization in generated outputs, aligning with preferred response formats.
Training Details
The DPO fine-tuning process involved:
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Method: Direct Preference Optimization (DPO)
- Epochs: 1
- Learning Rate: 1e-07
- Max Sequence Length: 1024
- Training Data: Utilized the
u-10bei/dpo-dataset-qwen-cotdataset for preference alignment.
Usage
As a merged model, it can be directly loaded and used with the transformers library for inference. The model operates under the MIT License, with compliance also required for the original base model's license terms.