tabidance/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The tabidance/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-4B-Instruct-2507 variant, fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized to improve reasoning capabilities through Chain-of-Thought and enhance structured response quality. This model excels in generating aligned and coherent outputs based on preferred data, making it suitable for tasks requiring precise and structured language generation.

Loading preview...

Model Overview

The tabidance/dpo-qwen-cot-merged model is a 4 billion parameter language model based on the Qwen/Qwen3-4B-Instruct-2507 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library, resulting in a merged 16-bit weight model that requires no adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning, allowing for more structured and logical response generation.
  • Improved Structured Output: Specifically trained to align responses with preferred outputs, enhancing the quality of structured data generation.
  • DPO Fine-tuning: Utilizes DPO to align model behavior with human preferences, leading to more desirable and coherent outputs.
  • Direct Usage: As a fully merged model, it can be used directly with the transformers library without additional configuration.

Training Details

The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The training data utilized was u-10bei/dpo-dataset-qwen-cot. The model's license is MIT, consistent with the dataset terms, and users must also adhere to the original base model's license.