Name: internlm/OREAL-32B API
Brand: Featherless.ai
Price: 25.00 USD
Availability: InStock
Author: internlm

OREAL-32B: Advanced Mathematical Reasoning Model

OREAL-32B is a 32 billion parameter model from InternLM, specifically designed for advanced mathematical reasoning. It leverages a novel reinforcement learning framework called Outcome Reward-based Reinforcement Learning (OREAL), which is particularly effective for tasks where only binary outcome rewards are available.

Key Capabilities & Innovations

Superior Mathematical Performance: OREAL-32B achieves a 95.0 pass@1 accuracy on MATH-500, outperforming other 32B models in this benchmark.
Novel RL Framework: The OREAL method incorporates best-of-N (BoN) sampling for behavior cloning and reshapes negative sample rewards for gradient consistency.
Sparse Reward Handling: It addresses sparse rewards in long chain-of-thought reasoning by using an on-policy token-level reward model to identify key tokens for importance sampling.
Comprehensive Evaluation: The model demonstrates strong performance across various mathematical benchmarks including AIME2024, AIME2025-I, LiveMath, and Olympiad.

When to Use OREAL-32B

Complex Mathematical Problem Solving: Ideal for applications requiring high accuracy in mathematical reasoning and problem-solving.
Research in RL for Reasoning: Useful for researchers exploring advanced reinforcement learning techniques for complex cognitive tasks.
Educational Tools: Can be integrated into systems that require rigorous, step-by-step mathematical explanations and solutions.

InternLM has also released OREAL-7B and corresponding SFT models, along with the RL training prompts, to support further community research in mathematical reasoning.

Overview

OREAL-32B: Advanced Mathematical Reasoning Model

Key Capabilities & Innovations

When to Use OREAL-32B

Full Model Card (README)