q-hisa/dpo-qwen-cot-merged-v5

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 21, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The q-hisa/dpo-qwen-cot-merged-v5 is a 4 billion parameter Qwen3-based instruction-tuned language model, fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized for improving reasoning through Chain-of-Thought (CoT) and enhancing structured response quality. This model excels in generating aligned and coherent outputs for complex reasoning tasks.

Loading preview...

Overview

This model, q-hisa/dpo-qwen-cot-merged-v5, is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base. It has been meticulously fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library to align its responses with preferred outputs. The training process involved an initial Supervised Fine-Tuning (SFT) phase using a LoRA adapter, followed by DPO to further refine its capabilities.

Key Capabilities

  • Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, enabling more structured and logical problem-solving.
  • Improved Structured Responses: Designed to produce higher quality, more coherent, and aligned outputs based on preference datasets.
  • Direct Preference Optimization (DPO): Utilizes DPO to align model behavior with human preferences, leading to more desirable and helpful responses.
  • Full-Merged Weights: The repository provides full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.

Training Details

The model underwent DPO training for 1 epoch with a learning rate of 1e-07 and a beta value of 0.05, using a maximum sequence length of 1024. The training data utilized was u-10bei/dpo-dataset-qwen-cot. The model is released under the MIT License, with users also required to comply with the original base model's license terms.

Good For

  • Applications requiring robust reasoning capabilities.
  • Generating structured and high-quality text responses.
  • Tasks where alignment with preferred outputs is crucial.