HuiyuWang/dpo-qwen-cot-merged is a 4 billion parameter Qwen3-based causal language model developed by HuiyuWang, fine-tuned through a multi-stage process including Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). This model specializes in structured transformation tasks and Chain-of-Thought (CoT) reasoning, leveraging preference alignment for improved generation. It is designed for academic research, competition submissions, and applications requiring robust structured data processing.
Loading preview...
Model Overview
HuiyuWang/dpo-qwen-cot-merged is a 4 billion parameter language model built upon the Qwen3-4B-Instruct-2507 base. It has undergone a sophisticated multi-stage fine-tuning process to enhance its capabilities, particularly in structured data transformation and Chain-of-Thought (CoT) reasoning. The training pipeline involved an initial Supervised Fine-Tuning (SFT) stage, followed by a refinement stage using 'hard' structured data, and finally, Direct Preference Optimization (DPO) to align the model with preferred CoT reasoning patterns.
Key Capabilities
- Multi-stage Fine-Tuning: Combines SFT, hard data refinement, and DPO for robust performance.
- Chain-of-Thought (CoT) Reasoning: Specifically aligned to generate step-by-step reasoning, with loss applied only to final outputs during SFT.
- Structured Transformation: Enhanced for tasks involving the manipulation and transformation of structured data.
- Preference Alignment: Utilizes DPO with (prompt, chosen, rejected) data to guide model behavior towards desired outputs.
- Memory-Efficient Training: Fine-tuned using QLoRA and Unsloth for efficient 4-bit training.
Intended Use Cases
This model is particularly well-suited for:
- Structured transformation tasks: Processing and converting structured data formats.
- Chain-of-Thought reasoning: Generating detailed, step-by-step solutions to complex problems.
- Preference-aligned generation: Producing outputs that adhere to specific desired patterns or styles.
- Academic research experiments: Exploring multi-stage fine-tuning and preference learning techniques.
- Competition submissions: As a robust foundation for AI challenges requiring reasoning and structured output.