yuzkawash/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 8, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The yuzkawash/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based causal language model, fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve structured response quality. This model excels in tasks requiring logical progression and coherent, well-formed outputs, making it suitable for complex analytical prompts.

Loading preview...

Overview

This model, yuzkawash/dpo-qwen-cot-merged, is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting improvements in reasoning and structured response generation. The model incorporates full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, allowing for more logical and step-by-step problem-solving.
  • Improved Structured Responses: Fine-tuned to produce higher quality, more coherent, and well-structured outputs based on preferred examples.
  • Direct Use: As a fully merged model, it can be used directly with the transformers library without additional configuration.

Training Details

The model was trained for 1.5 epochs with a learning rate of 2e-06 and a beta value of 0.2. It utilized a maximum sequence length of 1024 and incorporated LoRA configuration (r=8, alpha=16) which was subsequently merged into the base model. The training data used for DPO was sourced from [u-10bei/dpo-dataset-qwen-cot].

Ideal Use Cases

This model is particularly well-suited for applications requiring:

  • Complex problem-solving where step-by-step reasoning is crucial.
  • Generating structured data or responses that adhere to specific formats.
  • Tasks benefiting from improved coherence and logical flow in generated text.