moushi21/dpo-qwen-cot-merged20

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 22, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The moushi21/dpo-qwen-cot-merged20 is a 4 billion parameter Qwen3-based causal language model, fine-tuned using a four-stage iterative SFT and DPO process. Developed by moushi21, it is specifically optimized for structured data reasoning and Chain-of-Thought (CoT) generation, excelling in tasks requiring complex data format adherence and consistent, high-fidelity outputs. This model is designed for structural evaluation (StructEval-T) with a context length of 32768 tokens.

Loading preview...

Overview

This model, moushi21/dpo-qwen-cot-merged20, is a 4 billion parameter variant of the Qwen3-4B-Instruct-2507 base model. It has been meticulously developed through a four-stage iterative training process combining Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). This unique pipeline aims to achieve precise alignment and deep reasoning capabilities, particularly for structured data tasks.

Key Capabilities

  • Enhanced Complex Reasoning: Specialized in Chain-of-Thought (CoT) processing for structural evaluation.
  • Strict Structural Integrity: Designed to adhere to complex data formats such as JSON and tables.
  • High Consistency: Delivers reliable outputs, even across iterative, multi-turn interactions.
  • Full-Merged Weights: Provides 16-bit weights, eliminating the need for adapter loading.

Training Methodology

The model's training involved an iterative approach:

  1. Stage 1 (SFT): Established foundational knowledge with structured CoT trajectories.
  2. Stage 2 (DPO): Initial alignment to preferred reasoning paths.
  3. Stage 3 (SFT): Reinforced knowledge and refined output formats.
  4. Stage 4 (DPO): Final optimization for high-fidelity structured outputs.

Good For

  • Applications requiring robust structured data reasoning.
  • Tasks that benefit from Chain-of-Thought generation.
  • Scenarios demanding strict adherence to complex output formats (e.g., JSON parsing, table generation).
  • Use cases where consistent and reliable outputs are critical.