HuiyuWang/dpo-qwen-cot-merged
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

HuiyuWang/dpo-qwen-cot-merged is a 4 billion parameter Qwen3-based causal language model developed by HuiyuWang, fine-tuned through a multi-stage process including Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). This model specializes in structured transformation tasks and Chain-of-Thought (CoT) reasoning, leveraging preference alignment for improved generation. It is designed for academic research, competition submissions, and applications requiring robust structured data processing.

Loading preview...

Model Overview

HuiyuWang/dpo-qwen-cot-merged is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base. It has undergone a sophisticated multi-stage fine-tuning process to enhance its capabilities, particularly in structured data transformation and Chain-of-Thought (CoT) reasoning. The training pipeline involved an initial Supervised Fine-Tuning (SFT) stage, followed by a refinement stage using 'hard' structured data, and finally, Direct Preference Optimization (DPO) to align the model with preferred CoT reasoning patterns.

Key Capabilities

  • Multi-stage Fine-Tuning: Combines SFT, hard data refinement, and DPO for robust performance.
  • Chain-of-Thought (CoT) Reasoning: Specifically aligned to generate step-by-step reasoning, with loss applied only to final outputs during SFT.
  • Structured Transformation: Enhanced for tasks involving the manipulation and transformation of structured data.
  • Preference Alignment: Utilizes DPO with (prompt, chosen, rejected) data to guide model behavior towards desired outputs.
  • Memory-Efficient Training: Fine-tuned using QLoRA and Unsloth for efficient 4-bit training.

Intended Use Cases

This model is particularly well-suited for:

  • Structured transformation tasks: Processing and converting structured data formats.
  • Chain-of-Thought reasoning: Generating detailed, step-by-step solutions to complex problems.
  • Preference-aligned generation: Producing outputs that adhere to specific desired patterns or styles.
  • Academic research experiments: Exploring multi-stage fine-tuning and preference learning techniques.
  • Competition submissions: As a robust foundation for AI challenges requiring reasoning and structured output.