KSIMNB/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 28, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

KSIMNB/dpo-qwen-cot-merged is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized to improve reasoning capabilities, particularly Chain-of-Thought (CoT), and enhance the quality of structured responses. It is designed for applications requiring aligned and coherent text generation based on preferred outputs.

Loading preview...

Model Overview

KSIMNB/dpo-qwen-cot-merged is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library, to align its outputs with preferred responses. This model incorporates full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

  • Enhanced Reasoning (Chain-of-Thought): Optimized through DPO to improve the model's ability to generate logical and step-by-step reasoning processes.
  • Improved Structured Responses: Focuses on producing higher quality and more aligned structured outputs based on preference data.
  • Direct Use: As a fully merged model, it can be directly integrated and used with the transformers library without additional configuration for LoRA adapters.

Training Details

The model underwent a single epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The training utilized a maximum sequence length of 1024 tokens. The preference dataset used for training is u-10bei/dpo-dataset-qwen-cot.

When to Use This Model

This model is particularly suitable for use cases where reasoning quality and alignment to preferred response styles are critical. It is ideal for applications requiring coherent, structured, and logically sound text generation, especially in scenarios benefiting from Chain-of-Thought capabilities.