mark-22/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 6, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The mark-22/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-Instruct variant, fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library. This model is specifically optimized for enhancing reasoning capabilities, particularly Chain-of-Thought (CoT), and improving the quality of structured responses. It is designed for applications requiring aligned and coherent outputs based on preference datasets, offering improved performance in complex reasoning tasks.

Loading preview...

Model Overview

The mark-22/dpo-qwen-cot-merged model is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library to align its responses with preferred outputs. This model is provided with full-merged 16-bit weights, eliminating the need for adapter loading.

Key Optimizations & Features

  • Enhanced Reasoning (Chain-of-Thought): The primary objective of the DPO fine-tuning was to significantly improve the model's reasoning abilities, especially in Chain-of-Thought (CoT) processes.
  • Improved Structured Responses: Optimization also focused on generating higher quality and more structured outputs, based on the preference dataset used during training.
  • DPO Method: Utilizes Direct Preference Optimization for alignment, trained for 1 epoch with a learning rate of 1e-07 and a beta of 0.1.
  • Merged Weights: The model includes merged LoRA configurations (r=64, alpha=64) directly into the base model, simplifying deployment.
  • Context Length: Supports a maximum sequence length of 1024 tokens during training.

Ideal Use Cases

This model is particularly well-suited for applications where:

  • Complex Reasoning is required, benefiting from its enhanced Chain-of-Thought capabilities.
  • Aligned and Coherent Outputs are critical, leveraging its DPO-based preference alignment.
  • Structured Response Generation is a priority, such as in question-answering or data extraction tasks where output format matters.

Users should adhere to the MIT License of the training data (u-10bei/dpo-dataset-qwen-cot) and the original base model's license terms.