mutsumutsu/dpo-qwen-cot-merged-260205-tokenchg2024-1024

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 5, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

mutsumutsu/dpo-qwen-cot-merged-260205-tokenchg2024-1024 is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). It is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve structured response quality. This model is designed for applications requiring improved logical coherence and well-formed outputs.

Loading preview...

Model Overview

This model, mutsumutsu/dpo-qwen-cot-merged-260205-tokenchg2024-1024, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, resulting in a full-merged 16-bit weight model that requires no adapter loading.

Key Optimizations

The primary objective of this fine-tuning was to align the model's responses with preferred outputs, with a specific focus on:

  • Enhanced Reasoning: Improving Chain-of-Thought (CoT) capabilities.
  • Structured Response Quality: Generating more coherent and well-formed outputs based on a preference dataset.

Training Configuration

  • Base Model: Qwen/Qwen3-4B-Instruct-2507
  • Method: DPO (Direct Preference Optimization)
  • Epochs: 1
  • Learning Rate: 1e-07
  • Max Sequence Length: 2048

Intended Use Cases

This model is particularly well-suited for applications where:

  • Logical Reasoning is critical, benefiting from its CoT optimization.
  • High-Quality, Structured Outputs are required, such as in question-answering, summarization, or content generation tasks demanding clear organization.

Licensing

The model is distributed under the MIT License, consistent with its training data. Users must also adhere to the original base model's license terms.