hiro7ka/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 27, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The hiro7ka/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based instruction-tuned causal language model, fine-tuned by hiro7ka using Direct Preference Optimization (DPO) via Unsloth. Optimized for improved reasoning (Chain-of-Thought) and structured response quality, it leverages a 32768-token context length. This model is designed to provide aligned and coherent outputs, making it suitable for tasks requiring robust logical progression and well-structured answers.

Loading preview...

Overview

This model, hiro7ka/dpo-qwen-cot-merged, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library to enhance its response quality and alignment. The model incorporates full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized specifically to improve Chain-of-Thought (CoT) reasoning abilities.
  • Structured Response Quality: Fine-tuned to generate more coherent and well-structured outputs.
  • DPO Alignment: Utilizes Direct Preference Optimization to align model responses with preferred human outputs.
  • Efficient Deployment: Provided as a merged model, allowing direct use with the transformers library without additional LoRA adapter loading.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 2e-07 and a beta value of 0.08. It was trained with a maximum sequence length of 1024 tokens, using the u-10bei/dpo-dataset-qwen-cot dataset. The base model's license terms (MIT License as per the dataset) apply.

Use Cases

This model is particularly well-suited for applications requiring:

  • Improved logical reasoning in responses.
  • Generation of structured and high-quality text.
  • Tasks where alignment with preferred outputs is critical.