Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-d2
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 7, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-d2 is a 4 billion parameter Qwen3-4B-Instruct-based causal language model, fine-tuned by Rakushaking using Direct Preference Optimization (DPO). This model focuses on improving Chain-of-Thought reasoning quality and structured output consistency, particularly for general-purpose tasks. It features a 40960-token context length and is optimized for enhanced logical analysis and output structural accuracy.

Loading preview...

Model Overview

This model, Rakushaking/Qwen4b-SFT-d9-merged-after-dpo-d2, is a 4 billion parameter Qwen3-4B-Instruct variant that has undergone a two-stage fine-tuning process. Initially, it was instruction-tuned (SFT) with structured data and Chain-of-Thought samples. Subsequently, it was further optimized using Direct Preference Optimization (DPO) with a general-purpose preference dataset (u-10bei/dpo-dataset-qwen-cot) to enhance overall reasoning quality and output consistency.

Key Enhancements

  • Improved Chain-of-Thought (CoT) Reasoning: The DPO phase specifically targeted better logical flow and structured analysis in the model's reasoning process.
  • Enhanced Output Structural Accuracy: While parse rates for formats like JSON and CSV remained at 100%, the DPO improved the internal structure, field mapping, and data type handling of generated outputs.
  • General-Purpose Optimization: The DPO dataset focused on broad output quality rather than format-specific corrections, leading to a general uplift in response quality.

Performance Insights

During DPO training, the model achieved a 98.6% accuracy in distinguishing preferred from rejected outputs. Although the overall parse success rate on the public_150 benchmark remained at 89.3% (similar to the SFT stage), this model achieved the highest evaluation score at the time of its training, indicating qualitative improvements beyond simple parsing success. A modest improvement was observed in TOML parse rates (48% to 52%).

Limitations

This general DPO round did not include structured-data-specific preferences, meaning it has limited impact on resolving highly format-specific issues (e.g., TOML inline table vs. section styles). Future DPO rounds are planned to address these specific formatting challenges.