Qwen3-4B-SFT: A Reasoning-Focused Warm-Start Model

Qwen3-4B-SFT is a 4 billion parameter model developed by the Sea-Fill Community, fine-tuned from Qwen3-4B-Base using the verl framework. It addresses the need for reproducible 'warm-start' supervised fine-tuning (SFT) bases in open-source practice, bridging the gap between base models and reinforcement learning models.

Key Capabilities & Performance

This model is optimally aligned for Chain-of-Thought (CoT) and instruction following, serving as a robust foundation for Reinforcement Learning. It demonstrates significant performance improvements over its base model on challenging reasoning benchmarks:

AIME 2024: Improved from 11.25% to 20.8% (+9.55%)
AIME 2025: Improved from 6.46% to 19.4% (+12.94%)
AMC 2023: Improved from 31.09% to 58.0% (+26.91%)
GPQA-Diamond: Improved from 7.77% to 29.1% (+21.33%)

These figures highlight its enhanced ability in mathematical and general reasoning tasks. The model was trained with the Qwen chat template and has a context length of 32768 tokens.

Use Cases & Limitations

Qwen3-4B-SFT is particularly well-suited for:

Warm-starting Reinforcement Learning (RL) for alignment research.
Tasks requiring structured reasoning and instruction adherence, especially those benefiting from Chain-of-Thought prompting.

However, users should be aware of its limitations:

Not universally optimized for factual correctness.
May still produce hallucinations or unsafe outputs.
Performance can be sensitive to prompt style and decoding settings.

Overview

Qwen3-4B-SFT: A Reasoning-Focused Warm-Start Model

Key Capabilities & Performance

Use Cases & Limitations

Full Model Card (README)