KawausoHiroKawauso/dpo-qwen-cot-merged
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 8, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

KawausoHiroKawauso/dpo-qwen-cot-merged is a 4 billion parameter language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized for improving reasoning capabilities, particularly Chain-of-Thought (CoT), and generating high-quality structured responses. It leverages a preference dataset to align its outputs with desired formats and logical flows, making it suitable for tasks requiring structured and reasoned answers.

Loading preview...

Model Overview

This model, KawausoHiroKawauso/dpo-qwen-cot-merged, is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model, eliminating the need for adapter loading.

Key Optimizations

The primary objective of this DPO fine-tuning was to enhance the model's ability to generate improved reasoning (Chain-of-Thought) and produce high-quality structured responses. This was achieved by aligning the model's outputs with a specific preference dataset (u-10bei/dpo-dataset-qwen-cot).

Training Configuration

  • Base Model: Qwen/Qwen3-4B-Instruct-2507
  • Method: Direct Preference Optimization (DPO)
  • Epochs: 1
  • Learning Rate: 1e-05
  • Max Sequence Length: 1024

Ideal Use Cases

This model is particularly well-suited for applications requiring:

  • Enhanced Reasoning: Tasks that benefit from explicit, step-by-step logical deductions.
  • Structured Output Generation: Scenarios where responses need to adhere to specific formats or structures.
  • Preference Alignment: Use cases where model outputs should closely match human-preferred examples for quality and coherence.