rk611/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 18, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The rk611/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based causal language model, fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized for improving reasoning capabilities through Chain-of-Thought (CoT) and generating structured responses. This model is designed for tasks requiring enhanced logical progression and aligned output quality.

Loading preview...

Model Overview

The rk611/dpo-qwen-cot-merged model is a 4 billion parameter language model based on the Qwen/Qwen3-4B-Instruct-2507 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model, eliminating the need for adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, enabling more logical and structured problem-solving.
  • Aligned Responses: DPO training aligns the model's outputs with preferred responses, leading to higher quality and more relevant generations.
  • Structured Output: Focuses on generating well-structured and coherent responses based on the preference dataset used during training.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 1024 tokens. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset, and the model is released under the MIT License, adhering to the original base model's license terms.

Good For

  • Applications requiring improved logical reasoning and step-by-step thought processes.
  • Use cases where response quality and alignment with specific preferences are critical.
  • Generating structured and coherent text outputs.