ShimadaMasatsugu/dpo-qwen-cot-merged
ShimadaMasatsugu/dpo-qwen-cot-merged is a fine-tuned Qwen3-4B-Instruct-2507 model, optimized using Direct Preference Optimization (DPO) via Unsloth. This model focuses on enhancing reasoning capabilities through Chain-of-Thought (CoT) and improving structured response quality. It is designed for applications requiring precise, aligned outputs, particularly in reasoning tasks.
Loading preview...
Model Overview
ShimadaMasatsugu/dpo-qwen-cot-merged is a specialized language model derived from the Qwen3-4B-Instruct-2507 base model. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, with its 16-bit weights fully merged into the base model, eliminating the need for adapter loading.
Key Capabilities & Optimization
This model's primary optimization objective was to align its responses with preferred outputs, specifically targeting:
- Improved Reasoning: Enhanced Chain-of-Thought (CoT) capabilities.
- Structured Response Quality: Better generation of structured outputs based on a preference dataset.
Training Details
The DPO fine-tuning process involved:
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Method: Direct Preference Optimization (DPO)
- Epochs: 1
- Learning Rate: 1e-07
- Max Sequence Length: 1024
- Training Data: Utilized the u-10bei/dpo-dataset-qwen-cot dataset.
Usage & Licensing
As a merged model, it can be directly used with the transformers library. The model is released under the MIT License, consistent with its training data, and users must also adhere to the original base model's license terms.