OREAL-32B: Advanced Mathematical Reasoning Model
OREAL-32B is a 32 billion parameter model from InternLM, specifically designed for advanced mathematical reasoning. It leverages a novel reinforcement learning framework called Outcome Reward-based Reinforcement Learning (OREAL), which is particularly effective for tasks where only binary outcome rewards are available.
Key Capabilities & Innovations
- Superior Mathematical Performance: OREAL-32B achieves a 95.0 pass@1 accuracy on MATH-500, outperforming other 32B models in this benchmark.
- Novel RL Framework: The OREAL method incorporates best-of-N (BoN) sampling for behavior cloning and reshapes negative sample rewards for gradient consistency.
- Sparse Reward Handling: It addresses sparse rewards in long chain-of-thought reasoning by using an on-policy token-level reward model to identify key tokens for importance sampling.
- Comprehensive Evaluation: The model demonstrates strong performance across various mathematical benchmarks including AIME2024, AIME2025-I, LiveMath, and Olympiad.
When to Use OREAL-32B
- Complex Mathematical Problem Solving: Ideal for applications requiring high accuracy in mathematical reasoning and problem-solving.
- Research in RL for Reasoning: Useful for researchers exploring advanced reinforcement learning techniques for complex cognitive tasks.
- Educational Tools: Can be integrated into systems that require rigorous, step-by-step mathematical explanations and solutions.
InternLM has also released OREAL-7B and corresponding SFT models, along with the RL training prompts, to support further community research in mathematical reasoning.