HidekiKawai/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 3, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

HidekiKawai/dpo-qwen-cot-merged is a fine-tuned Qwen-based language model, optimized using Direct Preference Optimization (DPO) via Unsloth. This model focuses on enhancing reasoning capabilities through Chain-of-Thought (CoT) and improving structured response quality. It is provided as a full-merged 16-bit model, ready for direct use in applications requiring aligned and coherent text generation.

Loading preview...

Overview

This model, HidekiKawai/dpo-qwen-cot-merged, is a fine-tuned version of HidekiKawai/sft-qwen-merged. It leverages Direct Preference Optimization (DPO) with the Unsloth library to align its responses with preferred outputs.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, leading to more structured and logical outputs.
  • Improved Response Quality: Fine-tuned to produce higher quality, aligned responses based on a preference dataset.
  • Direct Use: Provided as a full-merged 16-bit model, eliminating the need for adapter loading and simplifying deployment with transformers.

Training Details

  • Base Model: HidekiKawai/sft-qwen-merged
  • Optimization Method: DPO (Direct Preference Optimization)
  • Epochs: 3
  • Learning Rate: 2e-05
  • Max Sequence Length: 1024
  • Training Data: Utilizes the u-10bei/dpo-dataset-qwen-cot dataset for preference alignment.

Usage

This model can be directly loaded and used with the transformers library for inference, as it contains the merged 16-bit weights. Users should ensure compliance with the MIT License and the original base model's license terms.