KotaroT1/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 5, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

KotaroT1/dpo-qwen-cot-merged is a 4 billion parameter Qwen3-based causal language model, fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library. This model is specifically optimized for enhancing reasoning capabilities, particularly Chain-of-Thought (CoT), and improving the quality of structured responses. It is designed for tasks requiring aligned and coherent outputs based on preferred datasets.

Loading preview...

Model Overview

This model, dpo-qwen-cot-merged, is a 4 billion parameter variant of the Qwen3 architecture, specifically fine-tuned from Qwen/Qwen3-4B-Instruct-2507. It leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs.

Key Capabilities

  • Enhanced Reasoning (Chain-of-Thought): Optimized to improve the model's ability to generate step-by-step reasoning processes.
  • Improved Structured Responses: Focuses on producing higher quality and more coherent structured outputs.
  • DPO Fine-tuning: Utilizes DPO with a preference dataset to guide response generation towards desired characteristics.
  • Merged Weights: Contains full 16-bit merged weights, eliminating the need for adapter loading and simplifying deployment with transformers.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length during training was 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model. The training data used for DPO was sourced from [u-10bei/dpo-dataset-qwen-cot].

Licensing

This model is released under the MIT License, consistent with the terms of its training dataset. Users must also adhere to the license terms of the original base model, Qwen3-4B-Instruct-2507.