Umezaki/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 4, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Umezaki/dpo-qwen-cot-merged is a 4 billion parameter Qwen3-based instruction-tuned causal language model, fine-tuned using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized for improving reasoning capabilities, particularly Chain-of-Thought (CoT), and generating structured responses. It is designed for tasks requiring enhanced logical progression and aligned output quality.

Loading preview...

Model Overview

Umezaki/dpo-qwen-cot-merged is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library, to enhance its response quality and alignment.

Key Capabilities

  • Improved Reasoning (Chain-of-Thought): The model's primary optimization target was to enhance its ability to generate logical, step-by-step reasoning processes.
  • Structured Response Quality: DPO training focused on aligning the model's outputs with preferred formats and structures, based on a specific preference dataset.
  • Full-Merged Weights: This repository provides the full 16-bit merged weights, eliminating the need for adapter loading and simplifying deployment.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. It utilized a maximum sequence length of 1024 during training. The LoRA configuration (r=8, alpha=16) was merged into the base model.

Good For

  • Applications requiring enhanced logical reasoning and problem-solving.
  • Generating structured outputs that adhere to specific formats.
  • Tasks where response alignment and quality are critical.

Usage

As a merged model, it can be directly loaded and used with the transformers library for inference. Users should be aware that the model's license (MIT) is derived from its training data, and compliance with the original base model's license terms is also required.