SillyWumpus/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 4, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The SillyWumpus/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based causal language model fine-tuned using Direct Preference Optimization (DPO). Developed by SillyWumpus, it is specifically optimized for improving reasoning capabilities through Chain-of-Thought (CoT) and enhancing structured response quality. This model is designed for tasks requiring aligned, high-quality outputs, particularly in complex reasoning scenarios.

Loading preview...

Model Overview

SillyWumpus/dpo-qwen-cot-merged is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged, eliminating the need for adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized specifically to improve Chain-of-Thought (CoT) reasoning processes.
  • Structured Response Quality: Aligned to produce higher quality and more structured outputs based on preference datasets.
  • DPO Fine-tuning: Leverages DPO to align model responses with preferred human outputs.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. It utilized a maximum sequence length of 1024 during training. The training data, u-10bei/dpo-dataset-qwen-cot, was instrumental in shaping its preference-aligned behavior.

Usage Considerations

As a merged model, it can be directly integrated and used with the transformers library. Users should adhere to the MIT License associated with the training data and comply with the original base model's license terms.