keijiban3/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Feb 20, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The keijiban3/dpo-qwen-cot-merged model is a 0.5 billion parameter Qwen3-4B-Instruct-2507 base model fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library. It is specifically optimized to improve reasoning capabilities, particularly Chain-of-Thought (CoT), and enhance structured response quality. This model is designed for tasks requiring improved logical coherence and adherence to preferred output formats.

Loading preview...

Model Overview

This model, keijiban3/dpo-qwen-cot-merged, is a fine-tuned version of the Qwen/Qwen3-4B-Instruct-2507 base model, featuring approximately 0.5 billion parameters and a context length of 32768 tokens. It has been optimized using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged for direct use without adapter loading.

Key Capabilities

  • Enhanced Reasoning: Specifically trained to improve Chain-of-Thought (CoT) reasoning, making it suitable for tasks requiring multi-step logical deduction.
  • Structured Response Quality: Optimized to align responses with preferred outputs, leading to more coherent and structured generations.
  • DPO Fine-tuning: Leverages DPO to refine model behavior based on preference datasets, aiming for higher quality and more aligned outputs.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model.

Good For

  • Applications requiring improved logical reasoning and step-by-step explanations.
  • Generating structured outputs that adhere to specific formats or preferences.
  • Tasks where response quality and alignment with human preferences are critical.