Overview
The internlm/OREAL-DeepSeek-R1-Distill-Qwen-7B is a 7.6 billion parameter language model from the OREAL series, developed by InternLM. It is specifically designed for advanced mathematical reasoning tasks, leveraging a novel reinforcement learning framework called Outcome REwArd-based reinforcement Learning (OREAL). This framework is tailored for scenarios where only binary outcome rewards are available, addressing the challenge of sparse rewards in long chain-of-thought reasoning by incorporating an on-policy token-level reward model.
Key Capabilities & Performance
- Exceptional Mathematical Reasoning: Achieves 94.0 pass@1 accuracy on MATH-500, demonstrating performance comparable to previous 32B models in mathematical problem-solving.
- Reinforcement Learning Optimization: Utilizes a unique RL approach with best-of-N (BoN) sampling and reshaped negative sample rewards for gradient consistency.
- Sparse Reward Handling: Employs an on-policy token-level reward model to identify key tokens in reasoning trajectories, crucial for complex mathematical proofs.
- Competitive Benchmarks: Outperforms many 7B and some 32B models on various mathematical benchmarks, including AIME2024, AIME2025-I, LiveMath, and Olympiad.
Use Cases
- Competitive Mathematics: Ideal for solving problems encountered in mathematical competitions.
- Advanced Problem Solving: Suitable for applications requiring rigorous logical deduction and multi-step mathematical reasoning.
- Research in RL for Reasoning: Provides a strong baseline and methodology for further research into reinforcement learning for complex reasoning tasks.