okap014/dpo-qwen-cot-merged
The okap014/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based instruction-tuned language model, fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve structured response quality. This model is designed for tasks requiring aligned, high-quality outputs, particularly in reasoning and structured generation.
Loading preview...
Model Overview
okap014/dpo-qwen-cot-merged is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has undergone fine-tuning using Direct Preference Optimization (DPO) with the Unsloth library, resulting in a merged 16-bit weight model that requires no adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized through DPO to improve Chain-of-Thought (CoT) reasoning abilities.
- Structured Response Quality: Focuses on generating higher quality and more structured outputs based on preference datasets.
- Direct Use: As a full-merged model, it can be directly utilized with the
transformerslibrary for inference.
Training Details
The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The training leveraged a specific DPO dataset (u-10bei/dpo-dataset-qwen-cot) to align its responses with preferred outputs. The model is released under the MIT License, with users also required to comply with the original base model's license terms.
Ideal Use Cases
This model is particularly well-suited for applications where improved reasoning, coherent thought processes, and high-quality, structured text generation are critical. Its DPO fine-tuning makes it effective for tasks requiring aligned and preferred response styles.