Overview
MetaStone-S1-32B is a 32.8 billion parameter reflective generative model developed by MetaStoneTec. It introduces a novel reflective generative form that unifies "Long-CoT Reinforcement Learning" and "Process Reward Learning." This unique training methodology allows the model to achieve both deep reasoning capabilities and efficient selection of high-quality reasoning trajectories simultaneously. By sharing the backbone network between policy models and Process Reward Models (PRMs), MetaStone-S1-32B significantly reduces PRM inference costs by 99%, leading to faster and higher-quality responses.
Key Capabilities
- Advanced Reasoning: Excels in complex mathematics, coding, and Chinese reasoning tasks.
- Efficient Inference: Achieves substantial reduction in PRM inference cost due to shared backbone architecture.
- Competitive Performance: Demonstrates performance comparable to larger models, including the OpenAI-o3 series, despite its 32.8B parameter size.
- Long Context: Supports a context length of 131072 tokens.
Performance Highlights
MetaStone-S1-32B (specifically the 'high' variant) shows strong benchmark results:
- AIME24: 85.2 (outperforming DeepSeek-R1-671B and OpenAI-o3-mini-medium)
- AIME25: 73.6 (competitive with OpenAI-o3-mini-medium)
- C-EVAL: 89.7 (competitive with DeepSeek-R1-671B)
Good for
- Applications requiring strong mathematical and coding reasoning.
- Tasks demanding high-quality, explainable reasoning trajectories.
- Scenarios where efficient inference for complex reasoning is critical.
- Use cases benefiting from a long context window for detailed problem-solving.