nisiwaki/dpo-qwen-cot-merged_01

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 7, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The nisiwaki/dpo-qwen-cot-merged_01 model is a 4 billion parameter Qwen3-4B-Instruct-2507 variant, fine-tuned by nisiwaki using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve structured response quality. It features a 40960 token context length and is designed for direct use with transformers, requiring no adapter loading.

Loading preview...

Overview

This model, nisiwaki/dpo-qwen-cot-merged_01, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, with its 16-bit weights fully merged into the base model for direct deployment.

Key Capabilities

  • Enhanced Reasoning: Specifically optimized to improve Chain-of-Thought (CoT) reasoning, making it suitable for tasks requiring logical progression and structured thinking.
  • Improved Response Quality: DPO fine-tuning aligns the model's outputs with preferred responses, leading to higher quality and more aligned generations.
  • Direct Use: As a fully merged model, it can be used directly with the transformers library without the need for separate adapter loading.

Training Details

The model was trained using a SFT + DPO approach over 3 epochs for each stage, with a learning rate of 1e-05 and a DPO beta of 0.1. The maximum sequence length during training was 1024 tokens. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset, and the model operates under an MIT License, with users also required to comply with the original base model's license terms.