Tamata1208/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 16, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Tamata1208/dpo-qwen-cot-merged is a 4 billion parameter Qwen3-based causal language model fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized for improving reasoning capabilities, particularly Chain-of-Thought (CoT), and enhancing the quality of structured responses. This model is designed for applications requiring aligned and coherent outputs in complex reasoning tasks.

Loading preview...

Model Overview

Tamata1208/dpo-qwen-cot-merged is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base model. It has been further fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library to enhance its performance.

Key Capabilities

  • Enhanced Reasoning (Chain-of-Thought): The model is specifically optimized to improve its ability to generate detailed, step-by-step reasoning processes, making it suitable for complex problem-solving.
  • Improved Structured Responses: Through DPO training, the model aligns its outputs with preferred formats, leading to higher quality and more consistent structured responses.
  • Direct Use: This repository provides the full-merged 16-bit weights, meaning no adapter loading is required for deployment, simplifying integration into existing workflows.

Training Details

The model underwent 2 epochs of DPO training with a learning rate of 5e-06 and a beta value of 0.1. It utilized a maximum sequence length of 2048 tokens and incorporated LoRA (r=8, alpha=16) which has been merged into the base model. The training data used for DPO was [u-10bei/dpo-dataset-qwen-cot].

Licensing

This model is released under the MIT License, consistent with the terms of its training dataset. Users must also adhere to the license terms of the original Qwen base model.