moushi21/dpo-qwen-cot-merged2 is a 4 billion parameter language model fine-tuned from unsloth/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model specializes in improving reasoning capabilities, particularly Chain-of-Thought (CoT), and generating structured responses. It is designed for tasks requiring enhanced logical progression and high-quality, aligned outputs, making it suitable for applications where precise and coherent reasoning is critical.
Loading preview...
Overview
moushi21/dpo-qwen-cot-merged2 is a 4 billion parameter language model derived from unsloth/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its LoRA adapters merged into the base model for direct use without additional loading.
Key Capabilities
- Enhanced Reasoning: Optimized specifically to improve Chain-of-Thought (CoT) reasoning, enabling more logical and step-by-step problem-solving.
- Structured Response Quality: Focuses on generating higher quality and more aligned outputs based on preference data.
- Direct Usage: Provided as a full-merged 16-bit model, allowing straightforward integration with the
transformerslibrary.
Good For
- Applications requiring improved logical reasoning and problem-solving.
- Generating structured and coherent text outputs.
- Tasks where alignment with preferred response styles is crucial.
- Developers seeking a 4B parameter model with enhanced CoT capabilities for efficient deployment.