yumiyumi/dpo-qwen-cot-merged
The yumiyumi/dpo-qwen-cot-merged model is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) with the Unsloth library. It is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve structured response quality. This model is designed for applications requiring improved logical coherence and structured output based on preference alignment.
Loading preview...
Model Overview
The yumiyumi/dpo-qwen-cot-merged model is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model, eliminating the need for adapter loading.
Key Capabilities & Optimization
- Enhanced Reasoning: The model's primary optimization objective was to improve its reasoning abilities, particularly through Chain-of-Thought (CoT) processes.
- Structured Response Quality: DPO training focused on aligning the model's outputs with preferred responses, leading to better structured and higher-quality generations.
- Efficient Fine-tuning: Utilized the Unsloth library for DPO, with a single epoch of training and a low learning rate (1e-07), indicating a targeted and efficient optimization process.
- Direct Usage: As a fully merged model, it can be used directly with the
transformerslibrary without additional configuration for LoRA adapters.
Training Details
The model was trained on the u-10bei/dpo-dataset-qwen-cot dataset, specifically chosen for preference alignment in reasoning tasks. The training employed a maximum sequence length of 1024 tokens and a DPO beta value of 0.1. The base model's license terms apply, and the training data is under an MIT License.
Good For
- Applications requiring improved logical reasoning and step-by-step thought processes.
- Scenarios where structured and high-quality responses are critical.
- Developers looking for a Qwen3-4B variant with enhanced preference alignment for specific output styles.