sokosokobe/dpo-qwen-cot-merged
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 3, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The sokosokobe/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based causal language model fine-tuned using Direct Preference Optimization (DPO). It is specifically optimized to improve reasoning capabilities through Chain-of-Thought (CoT) and enhance structured response quality. This model is designed for tasks requiring improved logical progression and coherent, well-structured outputs.

Loading preview...

Model Overview

The sokosokobe/dpo-qwen-cot-merged model is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its full 16-bit weights merged for direct use without adapters.

Key Optimizations

This model's primary optimization focuses on enhancing:

  • Reasoning (Chain-of-Thought): Improved ability to generate logical, step-by-step reasoning processes.
  • Structured Response Quality: Better coherence and organization in generated outputs, aligning with preferred response formats.

Training Details

The DPO fine-tuning process involved:

  • Base Model: Qwen/Qwen3-4B-Instruct-2507
  • Method: Direct Preference Optimization (DPO)
  • Epochs: 1
  • Learning Rate: 1e-07
  • Max Sequence Length: 1024
  • Training Data: Utilized the u-10bei/dpo-dataset-qwen-cot dataset for preference alignment.

Usage

As a merged model, it can be directly loaded and used with the transformers library for inference. The model operates under the MIT License, with compliance also required for the original base model's license terms.