shingo2211/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 5, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

shingo2211/dpo-qwen-cot-merged is a 4 billion parameter Qwen3-Instruct model fine-tuned using Direct Preference Optimization (DPO) to enhance reasoning (Chain-of-Thought) and structured response quality. Developed by shingo2211, this model leverages the Unsloth library for DPO training, resulting in improved alignment with preferred outputs. It is optimized for tasks requiring better logical flow and coherent, structured answers, making it suitable for applications where response quality and reasoning are critical.

Loading preview...

Model Overview

shingo2211/dpo-qwen-cot-merged is a 4 billion parameter language model based on the Qwen/Qwen3-4B-Instruct-2507 architecture. This model has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting improvements in reasoning capabilities and the quality of structured responses. The DPO process aligned the model's outputs with preferred examples from the u-10bei/dpo-dataset-qwen-cot dataset, enhancing its ability to generate more logical and coherent answers.

Key Capabilities

  • Enhanced Reasoning (Chain-of-Thought): Optimized to produce more structured and logical thought processes in its responses.
  • Improved Response Quality: Fine-tuned to generate preferred and higher-quality outputs, particularly for structured tasks.
  • Direct Use: Provided as a full-merged 16-bit model, eliminating the need for adapter loading and allowing direct integration with transformers.

Good For

  • Applications requiring models with strong reasoning and logical flow.
  • Use cases where structured and high-quality textual outputs are critical.
  • Developers seeking a Qwen3-based model with enhanced alignment to human preferences for response generation.