ShimadaMasatsugu/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 25, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

ShimadaMasatsugu/dpo-qwen-cot-merged is a fine-tuned Qwen3-4B-Instruct-2507 model, optimized using Direct Preference Optimization (DPO) via Unsloth. This model focuses on enhancing reasoning capabilities through Chain-of-Thought (CoT) and improving structured response quality. It is designed for applications requiring precise, aligned outputs, particularly in reasoning tasks.

Loading preview...

Model Overview

ShimadaMasatsugu/dpo-qwen-cot-merged is a specialized language model derived from the Qwen3-4B-Instruct-2507 base model. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, with its 16-bit weights fully merged into the base model, eliminating the need for adapter loading.

Key Capabilities & Optimization

This model's primary optimization objective was to align its responses with preferred outputs, specifically targeting:

  • Improved Reasoning: Enhanced Chain-of-Thought (CoT) capabilities.
  • Structured Response Quality: Better generation of structured outputs based on a preference dataset.

Training Details

The DPO fine-tuning process involved:

  • Base Model: Qwen/Qwen3-4B-Instruct-2507
  • Method: Direct Preference Optimization (DPO)
  • Epochs: 1
  • Learning Rate: 1e-07
  • Max Sequence Length: 1024
  • Training Data: Utilized the u-10bei/dpo-dataset-qwen-cot dataset.

Usage & Licensing

As a merged model, it can be directly used with the transformers library. The model is released under the MIT License, consistent with its training data, and users must also adhere to the original base model's license terms.