SeaFill2025/Qwen3-4B-SFT

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Mar 22, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

SeaFill2025/Qwen3-4B-SFT is a 4 billion parameter language model, derived from Qwen3-4B-Base and fine-tuned for enhanced reasoning and instruction following. This model excels as a warm-start for Reinforcement Learning, demonstrating significant improvements in mathematical and general reasoning benchmarks like AIME and GPQA-Diamond. It is specifically optimized for Chain-of-Thought (CoT) processes and instruction adherence, making it suitable for tasks requiring structured reasoning. With a context length of 32768 tokens, it provides a robust base for further alignment research.

Loading preview...

Qwen3-4B-SFT: A Reasoning-Focused Warm-Start Model

Qwen3-4B-SFT is a 4 billion parameter model developed by the Sea-Fill Community, fine-tuned from Qwen3-4B-Base using the verl framework. It addresses the need for reproducible 'warm-start' supervised fine-tuning (SFT) bases in open-source practice, bridging the gap between base models and reinforcement learning models.

Key Capabilities & Performance

This model is optimally aligned for Chain-of-Thought (CoT) and instruction following, serving as a robust foundation for Reinforcement Learning. It demonstrates significant performance improvements over its base model on challenging reasoning benchmarks:

  • AIME 2024: Improved from 11.25% to 20.8% (+9.55%)
  • AIME 2025: Improved from 6.46% to 19.4% (+12.94%)
  • AMC 2023: Improved from 31.09% to 58.0% (+26.91%)
  • GPQA-Diamond: Improved from 7.77% to 29.1% (+21.33%)

These figures highlight its enhanced ability in mathematical and general reasoning tasks. The model was trained with the Qwen chat template and has a context length of 32768 tokens.

Use Cases & Limitations

Qwen3-4B-SFT is particularly well-suited for:

  • Warm-starting Reinforcement Learning (RL) for alignment research.
  • Tasks requiring structured reasoning and instruction adherence, especially those benefiting from Chain-of-Thought prompting.

However, users should be aware of its limitations:

  • Not universally optimized for factual correctness.
  • May still produce hallucinations or unsafe outputs.
  • Performance can be sensitive to prompt style and decoding settings.