moushi21/dpo-qwen-cot-merged2
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 14, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

moushi21/dpo-qwen-cot-merged2 is a 4 billion parameter language model fine-tuned from unsloth/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model specializes in improving reasoning capabilities, particularly Chain-of-Thought (CoT), and generating structured responses. It is designed for tasks requiring enhanced logical progression and high-quality, aligned outputs, making it suitable for applications where precise and coherent reasoning is critical.

Loading preview...

Overview

moushi21/dpo-qwen-cot-merged2 is a 4 billion parameter language model derived from unsloth/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its LoRA adapters merged into the base model for direct use without additional loading.

Key Capabilities

  • Enhanced Reasoning: Optimized specifically to improve Chain-of-Thought (CoT) reasoning, enabling more logical and step-by-step problem-solving.
  • Structured Response Quality: Focuses on generating higher quality and more aligned outputs based on preference data.
  • Direct Usage: Provided as a full-merged 16-bit model, allowing straightforward integration with the transformers library.

Good For

  • Applications requiring improved logical reasoning and problem-solving.
  • Generating structured and coherent text outputs.
  • Tasks where alignment with preferred response styles is crucial.
  • Developers seeking a 4B parameter model with enhanced CoT capabilities for efficient deployment.