takatuki56/dpo-qwen-cot-merged-V1

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 7, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The takatuki56/dpo-qwen-cot-merged-V1 is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). It is specifically optimized to improve reasoning capabilities through Chain-of-Thought (CoT) and enhance structured response quality. This model is suitable for applications requiring aligned, high-quality outputs in reasoning tasks.

Loading preview...

Model Overview

takatuki56/dpo-qwen-cot-merged-V1 is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. This model leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs, focusing on enhancing reasoning (Chain-of-Thought) and the quality of structured responses.

Key Capabilities

  • Improved Reasoning: Optimized for better Chain-of-Thought capabilities, leading to more logical and structured problem-solving.
  • Enhanced Response Quality: DPO fine-tuning aims to produce higher quality and more aligned outputs based on preference datasets.
  • Direct Usage: Provided as a full-merged 16-bit weights model, it can be used directly with the transformers library without requiring adapter loading.

Training Details

The model was trained for 1 epoch with a learning rate of 5e-07 and a beta value of 0.1. It utilized a maximum sequence length of 4096 tokens and incorporated LoRA configuration (r=32, alpha=64) which has been merged into the base weights. The training data used was [u-10bei/dpo-dataset-qwen-cot].

Good For

  • Applications requiring strong reasoning and structured output generation.
  • Tasks where response alignment and quality are critical.
  • Developers looking for a Qwen3-4B variant with enhanced DPO-driven performance.