simplescaling/s1-32B
Hugging Face
TEXT GENERATIONConcurrency Cost:2Model Size:32BQuant:FP8Ctx Length:32kPublished:Jan 14, 2025License:apache-2.0Architecture:Transformer0.3K Open Weights Warm

The simplescaling/s1-32B is a 32 billion parameter reasoning model, fine-tuned from Qwen2.5-32B-Instruct by simplescaling. It is notable for achieving strong reasoning performance, matching o1-preview, despite being trained on only 1,000 examples. This model demonstrates test-time scaling through a technique called budget forcing, making it suitable for complex problem-solving tasks.

Loading preview...

Model Overview

simplescaling/s1-32B is a 32 billion parameter language model developed by simplescaling, specifically fine-tuned for reasoning tasks. It is based on the Qwen2.5-32B-Instruct architecture and stands out for its efficient training, utilizing only 1,000 examples to achieve competitive performance.

Key Capabilities and Features

  • Reasoning Focus: The model is specifically designed and optimized for complex reasoning, as evidenced by its performance on mathematical and general problem-solving benchmarks.
  • Efficient Training: Achieves strong results with a remarkably small training dataset of just 1,000 examples.
  • Test-Time Scaling: Incorporates "budget forcing" during evaluation, a technique that enhances its performance on reasoning tasks by allowing for iterative thinking.
  • Competitive Performance: Benchmarks indicate that s1-32B matches or exceeds the performance of models like o1-preview on metrics such as AIME2024 and MATH500, particularly when budget forcing is applied.

Evaluation Highlights

Evaluations show s1-32B's strong performance in reasoning:

  • AIME2024: 56.7
  • MATH500: 93.0
  • GPQA-Diamond: 59.6

It's important to note that these benchmark results for s1-32B utilize budget forcing, which involves ignoring end-of-thinking and appending "Wait" up to four times to enhance reasoning capabilities. For potentially better performance, users are recommended to consider its successor, s1.1-32B.

Use Cases

This model is particularly well-suited for applications requiring advanced reasoning and problem-solving, especially in domains like mathematics and complex question answering, where its budget forcing mechanism can be leveraged for improved accuracy.