SeaFill2025/Qwen3-4B-SFT
SeaFill2025/Qwen3-4B-SFT is a 4 billion parameter language model, derived from Qwen3-4B-Base and fine-tuned for enhanced reasoning and instruction following. This model excels as a warm-start for Reinforcement Learning, demonstrating significant improvements in mathematical and general reasoning benchmarks like AIME and GPQA-Diamond. It is specifically optimized for Chain-of-Thought (CoT) processes and instruction adherence, making it suitable for tasks requiring structured reasoning. With a context length of 32768 tokens, it provides a robust base for further alignment research.
Loading preview...
Qwen3-4B-SFT: A Reasoning-Focused Warm-Start Model
Qwen3-4B-SFT is a 4 billion parameter model developed by the Sea-Fill Community, fine-tuned from Qwen3-4B-Base using the verl framework. It addresses the need for reproducible 'warm-start' supervised fine-tuning (SFT) bases in open-source practice, bridging the gap between base models and reinforcement learning models.
Key Capabilities & Performance
This model is optimally aligned for Chain-of-Thought (CoT) and instruction following, serving as a robust foundation for Reinforcement Learning. It demonstrates significant performance improvements over its base model on challenging reasoning benchmarks:
- AIME 2024: Improved from 11.25% to 20.8% (+9.55%)
- AIME 2025: Improved from 6.46% to 19.4% (+12.94%)
- AMC 2023: Improved from 31.09% to 58.0% (+26.91%)
- GPQA-Diamond: Improved from 7.77% to 29.1% (+21.33%)
These figures highlight its enhanced ability in mathematical and general reasoning tasks. The model was trained with the Qwen chat template and has a context length of 32768 tokens.
Use Cases & Limitations
Qwen3-4B-SFT is particularly well-suited for:
- Warm-starting Reinforcement Learning (RL) for alignment research.
- Tasks requiring structured reasoning and instruction adherence, especially those benefiting from Chain-of-Thought prompting.
However, users should be aware of its limitations:
- Not universally optimized for factual correctness.
- May still produce hallucinations or unsafe outputs.
- Performance can be sensitive to prompt style and decoding settings.