Name: moushi21/dpo-qwen-cot-merged20 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: moushi21

Overview

This model, moushi21/dpo-qwen-cot-merged20, is a 4 billion parameter variant of the Qwen3-4B-Instruct-2507 base model. It has been meticulously developed through a four-stage iterative training process combining Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). This unique pipeline aims to achieve precise alignment and deep reasoning capabilities, particularly for structured data tasks.

Key Capabilities

Enhanced Complex Reasoning: Specialized in Chain-of-Thought (CoT) processing for structural evaluation.
Strict Structural Integrity: Designed to adhere to complex data formats such as JSON and tables.
High Consistency: Delivers reliable outputs, even across iterative, multi-turn interactions.
Full-Merged Weights: Provides 16-bit weights, eliminating the need for adapter loading.

Training Methodology

The model's training involved an iterative approach:

Stage 1 (SFT): Established foundational knowledge with structured CoT trajectories.
Stage 2 (DPO): Initial alignment to preferred reasoning paths.
Stage 3 (SFT): Reinforced knowledge and refined output formats.
Stage 4 (DPO): Final optimization for high-fidelity structured outputs.

Good For

Applications requiring robust structured data reasoning.
Tasks that benefit from Chain-of-Thought generation.
Scenarios demanding strict adherence to complex output formats (e.g., JSON parsing, table generation).
Use cases where consistent and reliable outputs are critical.

Overview

Overview

Key Capabilities

Training Methodology

Good For

Full Model Card (README)