internlm/OREAL-DeepSeek-R1-Distill-Qwen-7B

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Feb 10, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

The internlm/OREAL-DeepSeek-R1-Distill-Qwen-7B is a 7.6 billion parameter mathematical reasoning model developed by InternLM, fine-tuned using Outcome Reward-based Reinforcement Learning (OREAL). This model excels in complex mathematical problem-solving, achieving 94.0 pass@1 accuracy on MATH-500, matching larger 32B models. It is specifically optimized for tasks where only binary outcome rewards are available, making it highly effective for competitive mathematics and rigorous logical reasoning.

Loading preview...

Overview

The internlm/OREAL-DeepSeek-R1-Distill-Qwen-7B is a 7.6 billion parameter language model from the OREAL series, developed by InternLM. It is specifically designed for advanced mathematical reasoning tasks, leveraging a novel reinforcement learning framework called Outcome REwArd-based reinforcement Learning (OREAL). This framework is tailored for scenarios where only binary outcome rewards are available, addressing the challenge of sparse rewards in long chain-of-thought reasoning by incorporating an on-policy token-level reward model.

Key Capabilities & Performance

  • Exceptional Mathematical Reasoning: Achieves 94.0 pass@1 accuracy on MATH-500, demonstrating performance comparable to previous 32B models in mathematical problem-solving.
  • Reinforcement Learning Optimization: Utilizes a unique RL approach with best-of-N (BoN) sampling and reshaped negative sample rewards for gradient consistency.
  • Sparse Reward Handling: Employs an on-policy token-level reward model to identify key tokens in reasoning trajectories, crucial for complex mathematical proofs.
  • Competitive Benchmarks: Outperforms many 7B and some 32B models on various mathematical benchmarks, including AIME2024, AIME2025-I, LiveMath, and Olympiad.

Use Cases

  • Competitive Mathematics: Ideal for solving problems encountered in mathematical competitions.
  • Advanced Problem Solving: Suitable for applications requiring rigorous logical deduction and multi-step mathematical reasoning.
  • Research in RL for Reasoning: Provides a strong baseline and methodology for further research into reinforcement learning for complex reasoning tasks.