hifill/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 7, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The hifill/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based instruction-tuned language model, fine-tuned using Direct Preference Optimization (DPO) by hifill. It is specifically optimized for improving reasoning capabilities through Chain-of-Thought (CoT) and generating structured responses. This model is designed for tasks requiring enhanced logical deduction and coherent, well-organized output.

Loading preview...

Model Overview

The hifill/dpo-qwen-cot-merged model is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged for direct use without adapters.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, enabling more logical and step-by-step problem-solving.
  • Structured Response Generation: Aligned to produce higher quality, more structured outputs based on preferred response patterns.
  • DPO Fine-tuning: Leverages DPO to align model behavior with human preferences, focusing on specific response characteristics.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta of 0.1. It utilized a maximum sequence length of 1024 and incorporated LoRA configuration (r=8, alpha=16) which was subsequently merged into the base model. The training data used for DPO was u-10bei/dpo-dataset-qwen-cot.

When to Use This Model

This model is particularly well-suited for applications where improved reasoning, logical coherence, and structured output are critical. It can be beneficial for tasks requiring detailed explanations, step-by-step problem-solving, or generating responses that adhere to specific formats. Users should be aware that the model's license follows the MIT License, as per the dataset terms, and compliance with the original base model's license is also required.