sei0621/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 6, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The sei0621/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based causal language model fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized to improve reasoning capabilities through Chain-of-Thought (CoT) and enhance structured response quality. This model is designed for tasks requiring improved logical coherence and adherence to preferred output formats.

Loading preview...

Model Overview

This model, sei0621/dpo-qwen-cot-merged, is a 4 billion parameter language model based on Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library to enhance its performance.

Key Capabilities

  • Improved Reasoning: Optimized to enhance Chain-of-Thought (CoT) reasoning, allowing for more logical and structured problem-solving.
  • Enhanced Response Quality: DPO training aligns the model's outputs with preferred responses, leading to higher quality and more structured generations.
  • Direct Use: Provided as a full-merged 16-bit weight model, eliminating the need for adapter loading and simplifying deployment with transformers.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 5e-07 and a beta value of 0.1. It utilized a maximum sequence length of 1024 and was trained on the u-10bei/dpo-dataset-qwen-cot preference dataset. The base model's context length is 32768 tokens.

Usage Considerations

This model is suitable for applications where improved reasoning, structured output, and alignment with specific response preferences are critical. Users should be aware that the model's license follows the MIT License, as per the training dataset terms, and compliance with the original base model's license is also required.