Overview
This model, moushi21/dpo-qwen-cot-merged20, is a 4 billion parameter variant of the Qwen3-4B-Instruct-2507 base model. It has been meticulously developed through a four-stage iterative training process combining Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). This unique pipeline aims to achieve precise alignment and deep reasoning capabilities, particularly for structured data tasks.
Key Capabilities
- Enhanced Complex Reasoning: Specialized in Chain-of-Thought (CoT) processing for structural evaluation.
- Strict Structural Integrity: Designed to adhere to complex data formats such as JSON and tables.
- High Consistency: Delivers reliable outputs, even across iterative, multi-turn interactions.
- Full-Merged Weights: Provides 16-bit weights, eliminating the need for adapter loading.
Training Methodology
The model's training involved an iterative approach:
- Stage 1 (SFT): Established foundational knowledge with structured CoT trajectories.
- Stage 2 (DPO): Initial alignment to preferred reasoning paths.
- Stage 3 (SFT): Reinforced knowledge and refined output formats.
- Stage 4 (DPO): Final optimization for high-fidelity structured outputs.
Good For
- Applications requiring robust structured data reasoning.
- Tasks that benefit from Chain-of-Thought generation.
- Scenarios demanding strict adherence to complex output formats (e.g., JSON parsing, table generation).
- Use cases where consistent and reliable outputs are critical.