internlm/OREAL-7B-SFT

Warm
Public
7.6B
FP8
131072
License: apache-2.0
Hugging Face
Overview

OREAL-7B-SFT: Mathematical Reasoning Foundation

OREAL-7B-SFT is a 7.6 billion parameter supervised fine-tuned (SFT) model from InternLM, serving as the initial policy for the OREAL (Outcome REwArd-based reinforcement Learning) framework. OREAL is a novel RL framework optimized for tasks with binary outcome rewards, particularly excelling in mathematical reasoning. While this specific model is the SFT base, the OREAL RL-trained models demonstrate exceptional performance.

Key Capabilities & Performance

OREAL-7B-SFT is part of a series that achieves strong results in mathematical benchmarks. The OREAL-7B RL model, for instance, reaches 94.0 pass@1 accuracy on MATH-500, matching the performance of previous 32B models. This SFT version provides the foundation for such advanced reasoning capabilities. The OREAL framework incorporates techniques like best-of-N sampling for behavior cloning and an on-policy token-level reward model to address sparse rewards in long chain-of-thought reasoning.

Intended Use Cases

  • Mathematical Reasoning: Ideal for tasks requiring systematic thinking, rigorous proof, and multi-angle analysis in mathematics.
  • Foundation for RL Training: Serves as a robust base model for further reinforcement learning in complex reasoning domains.
  • Research in Mathematical AI: Useful for researchers exploring advanced techniques in AI for mathematics, particularly those interested in outcome reward-based RL.

Users should note that the OREAL models utilize a specific system prompt to guide the model's reasoning process, emphasizing deep understanding, multi-angle analysis, systematic thinking, rigorous proof, and repeated verification.