Name: Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-d2 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Rakushaking

Model Overview

This model, Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-d2, is a 4 billion parameter Qwen3-4B-Instruct variant that has undergone a two-stage fine-tuning process. Initially, it was instruction-tuned (SFT) with structured data and Chain-of-Thought samples. Subsequently, it was further optimized using Direct Preference Optimization (DPO) with a general-purpose preference dataset (u-10bei/dpo-dataset-qwen-cot) to enhance overall reasoning quality and output consistency.

Key Enhancements

Improved Chain-of-Thought (CoT) Reasoning: The DPO phase specifically targeted better logical flow and structured analysis in the model's reasoning process.
Enhanced Output Structural Accuracy: While parse rates for formats like JSON and CSV remained at 100%, the DPO improved the internal structure, field mapping, and data type handling of generated outputs.
General-Purpose Optimization: The DPO dataset focused on broad output quality rather than format-specific corrections, leading to a general uplift in response quality.

Performance Insights

During DPO training, the model achieved a 98.6% accuracy in distinguishing preferred from rejected outputs. Although the overall parse success rate on the public_150 benchmark remained at 89.3% (similar to the SFT stage), this model achieved the highest evaluation score at the time of its training, indicating qualitative improvements beyond simple parsing success. A modest improvement was observed in TOML parse rates (48% to 52%).

Limitations

This general DPO round did not include structured-data-specific preferences, meaning it has limited impact on resolving highly format-specific issues (e.g., TOML inline table vs. section styles). Future DPO rounds are planned to address these specific formatting challenges.

Overview

Model Overview

Key Enhancements

Performance Insights

Limitations

Full Model Card (README)