SeaFill2025/Qwen3-4B-SFT
SeaFill2025/Qwen3-4B-SFT-Math is a 4 billion parameter Qwen3-based model developed by the Sea-Fill Community, specifically fine-tuned for long-think mathematical reasoning. Derived from Qwen3-4B-Base, it excels in complex math problems, demonstrating significant improvements on benchmarks like AIME and AMC. This model is optimized for Chain-of-Thought (CoT) and instruction following in mathematical contexts, serving as a robust warm-start for reinforcement learning research.
Loading preview...
Qwen3-4B-SFT-Math: Specialized for Mathematical Reasoning
Qwen3-4B-SFT-Math is a 4 billion parameter model from the Sea-Fill Community, fine-tuned from Qwen3-4B-Base using a pure long-think math recipe at a ~45K scale. This model addresses the need for reproducible 'warm-start' SFT bases, bridging the gap between base models and reinforcement learning models, particularly for math-focused applications.
Key Capabilities & Features
- Exceptional Math Reasoning: Demonstrates substantial performance gains in mathematical reasoning, with Pass@1 accuracy improvements of +20.62% on AIME 2025, +19.79% on AIME 2026, and +42.81% on AMC 2023 compared to its base model.
- Optimized for CoT: Aligned for Chain-of-Thought (CoT) and instruction following, making it suitable for complex problem-solving requiring detailed step-by-step reasoning.
- Warm-Start for RL: Designed as a robust SFT-only baseline for reinforcement learning (RL) research, allowing for further alignment studies.
- Qwen Chat Template: Trained with the Qwen chat template, expecting responses to end with
<|im_end|>. Users should configureeos_token_idto 151645.
Use Cases & Limitations
- Good for: Pure mathematical reasoning tasks, especially those requiring long-think processes and detailed derivations. Ideal for researchers exploring SFT-to-RL alignment in math domains.
- Limitations: This model is math-only SFT and is not optimized for general-domain reasoning, factuality, or instruction following outside of mathematics. It may produce hallucinations or unsafe outputs in non-math contexts. Users should also be aware that long rollouts are common, with a significant fraction hitting the 16K token cap on hard problems, suggesting a larger token budget (e.g., 32K) for AIME-level evaluations.