dormouse2/dpo-qwen-cot-merged
The dormouse2/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-4B-Instruct-2507 base model fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library. It is optimized to improve reasoning capabilities, specifically Chain-of-Thought (CoT), and structured response quality. This model is designed for tasks requiring aligned, high-quality outputs, particularly in reasoning-intensive applications.
Loading preview...
dormouse2/dpo-qwen-cot-merged: DPO-Optimized Qwen3-4B for Enhanced Reasoning
This model is a 4 billion parameter variant of the Qwen3-4B-Instruct-2507 base model, fine-tuned by dormouse2 using Direct Preference Optimization (DPO) via the Unsloth library. The primary objective of this optimization was to align the model's responses with preferred outputs, significantly enhancing its reasoning capabilities (Chain-of-Thought) and the overall quality of structured responses.
Key Capabilities & Features
- Enhanced Reasoning: Specifically optimized for Chain-of-Thought (CoT) reasoning, making it suitable for complex problem-solving.
- Improved Response Quality: DPO fine-tuning aligns outputs with preferred examples, leading to more coherent and structured answers.
- Full-Merged Weights: The repository provides the full-merged 16-bit weights, eliminating the need for adapter loading.
- Efficient Training: Fine-tuned with DPO over 1 epoch, utilizing a learning rate of 1e-07 and a beta of 0.1, with a maximum sequence length of 1024.
Good For
- Applications requiring strong reasoning and logical deduction.
- Generating structured and high-quality responses based on preference data.
- Tasks where alignment with specific output styles is crucial.
This model leverages the u-10bei/dpo-dataset-qwen-cot for its DPO training, and is released under the MIT License, with users also required to comply with the original base model's license terms.