takeshi200ok/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 3, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The takeshi200ok/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based causal language model developed by takeshi200ok. It has been fine-tuned using Direct Preference Optimization (DPO) to enhance reasoning capabilities, specifically Chain-of-Thought (CoT), and improve structured response quality. This fully merged 16-bit model is optimized for generating aligned and coherent outputs in reasoning-intensive tasks.

Loading preview...

Overview

This model, dpo-qwen-cot-merged, is a 4 billion parameter language model based on the Qwen3 architecture, specifically fine-tuned from Qwen/Qwen3-4B-Instruct-2507. Developed by takeshi200ok, it leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, making it suitable for tasks requiring multi-step logical deduction.
  • Improved Structured Responses: DPO training focuses on generating higher quality and more structured outputs.
  • Fully Merged Model: Provided as a complete 16-bit model, eliminating the need for separate adapter loading.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-05 and a beta value of 0.1. It utilized a maximum sequence length of 3072 tokens. The training data was sourced from u-10bei/dpo-dataset-qwen-cot.

Good For

  • Applications requiring robust reasoning and logical inference.
  • Generating structured and coherent text outputs.
  • Use cases where response alignment and quality are critical.