stemask2985/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 5, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The stemask2985/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based language model, fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized to improve reasoning capabilities through Chain-of-Thought (CoT) and enhance structured response quality. This model is designed for applications requiring improved logical coherence and adherence to preferred output formats.

Loading preview...

Model Overview

This model, stemask2985/dpo-qwen-cot-merged, is a 4 billion parameter variant of the Qwen3-4B-Instruct-2507 base model. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting enhanced reasoning and structured output generation.

Key Capabilities & Features

  • Improved Reasoning (Chain-of-Thought): Optimized to produce more coherent and logical reasoning steps in its responses.
  • Enhanced Structured Output: Fine-tuned to align responses with preferred formats, improving the quality of structured data generation.
  • DPO Fine-tuning: Utilizes Direct Preference Optimization for better alignment with human preferences.
  • Full-Merged Weights: Provided as a 16-bit merged model, eliminating the need for adapter loading during deployment.

Training Details

The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1, using a maximum sequence length of 1024. The training leveraged a specific DPO dataset (u-10bei/dpo-dataset-qwen-cot) focused on Chain-of-Thought examples.

Ideal Use Cases

  • Applications requiring robust reasoning abilities.
  • Scenarios where structured and high-quality responses are critical.
  • Tasks benefiting from preference-aligned language generation.