harutoshi/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Feb 3, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The harutoshi/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-4B-Instruct-2507 variant, fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library. This model specializes in enhancing reasoning capabilities through Chain-of-Thought (CoT) and improving the quality of structured responses. It is optimized for tasks requiring aligned and preferred outputs, making it suitable for applications where response coherence and logical flow are critical.

Loading preview...

Model Overview

This model, harutoshi/dpo-qwen-cot-merged, is a 4 billion parameter language model based on the Qwen/Qwen3-4B-Instruct-2507 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, leading to more logical and structured responses.
  • Structured Output Quality: Focuses on generating higher quality, aligned outputs based on preference datasets.
  • Direct Use: Provided as full-merged 16-bit weights, eliminating the need for adapter loading and allowing direct use with transformers.

Training Details

The model underwent DPO training for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model.

Licensing

This model operates under the MIT License, as per the terms of its training data (u-10bei/dpo-dataset-qwen-cot). Users must also adhere to the license terms of the original base model, Qwen/Qwen3-4B-Instruct-2507.