Model Overview
simplescaling/s1-32B is a 32 billion parameter language model developed by simplescaling, specifically fine-tuned for reasoning tasks. It is based on the Qwen2.5-32B-Instruct architecture and stands out for its efficient training, utilizing only 1,000 examples to achieve competitive performance.
Key Capabilities and Features
- Reasoning Focus: The model is specifically designed and optimized for complex reasoning, as evidenced by its performance on mathematical and general problem-solving benchmarks.
- Efficient Training: Achieves strong results with a remarkably small training dataset of just 1,000 examples.
- Test-Time Scaling: Incorporates "budget forcing" during evaluation, a technique that enhances its performance on reasoning tasks by allowing for iterative thinking.
- Competitive Performance: Benchmarks indicate that s1-32B matches or exceeds the performance of models like o1-preview on metrics such as AIME2024 and MATH500, particularly when budget forcing is applied.
Evaluation Highlights
Evaluations show s1-32B's strong performance in reasoning:
- AIME2024: 56.7
- MATH500: 93.0
- GPQA-Diamond: 59.6
It's important to note that these benchmark results for s1-32B utilize budget forcing, which involves ignoring end-of-thinking and appending "Wait" up to four times to enhance reasoning capabilities. For potentially better performance, users are recommended to consider its successor, s1.1-32B.
Use Cases
This model is particularly well-suited for applications requiring advanced reasoning and problem-solving, especially in domains like mathematics and complex question answering, where its budget forcing mechanism can be leveraged for improved accuracy.