takayosh/dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 6, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The takayosh/dpo-qwen-cot-merged model is a 4 billion parameter Qwen3-based causal language model, fine-tuned using Direct Preference Optimization (DPO) via Unsloth. It is specifically optimized to enhance reasoning capabilities through Chain-of-Thought (CoT) and improve the quality of structured responses. This model is designed for applications requiring improved logical coherence and adherence to preferred output formats.

Loading preview...

Overview

This model, takayosh/dpo-qwen-cot-merged, is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base. It has been fine-tuned using Direct Preference Optimization (DPO), leveraging the Unsloth library to align its outputs with preferred responses. The fine-tuning process focused on enhancing the model's reasoning abilities (Chain-of-Thought) and its capacity to generate structured, high-quality responses.

Key Characteristics

  • Base Model: Qwen/Qwen3-4B-Instruct-2507.
  • Fine-tuning Method: Direct Preference Optimization (DPO) for improved alignment.
  • Optimization Focus: Enhanced reasoning (Chain-of-Thought) and structured output quality.
  • Architecture: Full-merged 16-bit weights, eliminating the need for adapter loading.
  • Training Data: Utilized the u-10bei/dpo-dataset-qwen-cot dataset.
  • Context Length: Supports a maximum sequence length of 1024 tokens during training, with a base context length of 40960 tokens.

Good For

  • Applications requiring models with improved logical reasoning and step-by-step thought processes.
  • Scenarios where structured and coherent responses are critical.
  • Developers looking for a Qwen3-based model with enhanced alignment to preferred output styles.

Usage

The model can be directly integrated into projects using the transformers library, as it provides full-merged weights.