yumiyumi/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 8, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The yumiyumi/dpo-qwen-cot-merged model is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) with the Unsloth library. It is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve structured response quality. This model is designed for applications requiring improved logical coherence and structured output based on preference alignment.

Loading preview...

Model Overview

The yumiyumi/dpo-qwen-cot-merged model is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model, eliminating the need for adapter loading.

Key Capabilities & Optimization

  • Enhanced Reasoning: The model's primary optimization objective was to improve its reasoning abilities, particularly through Chain-of-Thought (CoT) processes.
  • Structured Response Quality: DPO training focused on aligning the model's outputs with preferred responses, leading to better structured and higher-quality generations.
  • Efficient Fine-tuning: Utilized the Unsloth library for DPO, with a single epoch of training and a low learning rate (1e-07), indicating a targeted and efficient optimization process.
  • Direct Usage: As a fully merged model, it can be used directly with the transformers library without additional configuration for LoRA adapters.

Training Details

The model was trained on the u-10bei/dpo-dataset-qwen-cot dataset, specifically chosen for preference alignment in reasoning tasks. The training employed a maximum sequence length of 1024 tokens and a DPO beta value of 0.1. The base model's license terms apply, and the training data is under an MIT License.

Good For

  • Applications requiring improved logical reasoning and step-by-step thought processes.
  • Scenarios where structured and high-quality responses are critical.
  • Developers looking for a Qwen3-4B variant with enhanced preference alignment for specific output styles.