rokugatsu/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 4, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The rokugatsu/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based causal language model, fine-tuned by rokugatsu using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve the quality of structured responses. It is designed for applications requiring improved logical coherence and well-formed outputs.

Loading preview...

Model Overview

The rokugatsu/dpo-qwen-cot-merged model is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base. It has been fine-tuned by rokugatsu using Direct Preference Optimization (DPO), leveraging the Unsloth library to align its responses with preferred outputs. This repository provides the full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, leading to more logical and coherent outputs.
  • Improved Structured Responses: Focuses on enhancing the quality of structured responses based on a preference dataset.
  • Direct Preference Optimization (DPO): Utilizes DPO for alignment, aiming for better human preference adherence.

Training Details

  • Base Model: Qwen/Qwen3-4B-Instruct-2507
  • Methodology: Direct Preference Optimization (DPO)
  • Training Data: u-10bei/dpo-dataset-qwen-cot
  • Configuration: Trained for 1 epoch with a learning rate of 1e-07 and a max sequence length of 1024.

Usage Considerations

This model is ready for direct use with the transformers library, as it contains merged weights. Users must adhere to the MIT License of the training data and the original base model's license terms.