moushi21/dpo-qwen-cot-merged20
The moushi21/dpo-qwen-cot-merged20 is a 4 billion parameter Qwen3-based causal language model, fine-tuned using a four-stage iterative SFT and DPO process. Developed by moushi21, it is specifically optimized for structured data reasoning and Chain-of-Thought (CoT) generation, excelling in tasks requiring complex data format adherence and consistent, high-fidelity outputs. This model is designed for structural evaluation (StructEval-T) with a context length of 32768 tokens.
Loading preview...
Overview
This model, moushi21/dpo-qwen-cot-merged20, is a 4 billion parameter variant of the Qwen3-4B-Instruct-2507 base model. It has been meticulously developed through a four-stage iterative training process combining Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). This unique pipeline aims to achieve precise alignment and deep reasoning capabilities, particularly for structured data tasks.
Key Capabilities
- Enhanced Complex Reasoning: Specialized in Chain-of-Thought (CoT) processing for structural evaluation.
- Strict Structural Integrity: Designed to adhere to complex data formats such as JSON and tables.
- High Consistency: Delivers reliable outputs, even across iterative, multi-turn interactions.
- Full-Merged Weights: Provides 16-bit weights, eliminating the need for adapter loading.
Training Methodology
The model's training involved an iterative approach:
- Stage 1 (SFT): Established foundational knowledge with structured CoT trajectories.
- Stage 2 (DPO): Initial alignment to preferred reasoning paths.
- Stage 3 (SFT): Reinforced knowledge and refined output formats.
- Stage 4 (DPO): Final optimization for high-fidelity structured outputs.
Good For
- Applications requiring robust structured data reasoning.
- Tasks that benefit from Chain-of-Thought generation.
- Scenarios demanding strict adherence to complex output formats (e.g., JSON parsing, table generation).
- Use cases where consistent and reliable outputs are critical.