EurusPRM-Stage1 is a 7.6 billion parameter reward model developed by PRIME-RL, designed to provide process-level rewards without requiring explicit step-by-step annotations. It utilizes Implicit PRM, a method that derives process rewards from cheaper response-level labels by training an Outcome Reward Model (ORM) and calculating log-likelihood ratios during inference. This model is particularly optimized for mathematical reasoning tasks, demonstrating strong performance in Best-of-N sampling evaluations across various math benchmarks.
Loading preview...
EurusPRM-Stage1: Implicit Process Reward Model
EurusPRM-Stage1 is a 7.6 billion parameter reward model developed by PRIME-RL, distinguished by its novel approach to generating process-level rewards. Unlike traditional Process Reward Models (PRMs) that require expensive step-by-step human annotations, EurusPRM-Stage1 leverages Implicit PRM to obtain these rewards at no additional cost, relying solely on more readily available response-level labels.
Key Capabilities
- Implicit Process Reward Generation: Derives process rewards by training an Outcome Reward Model (ORM) on response-level labels and calculating log-likelihood ratios during inference, eliminating the need for explicit step-level annotations.
- Mathematical Reasoning Optimization: Demonstrates strong performance in mathematical reasoning tasks, as evidenced by its evaluation across benchmarks like MATH, AMC, AIME 2024, OlympiadBench, and Minerva Math.
- Flexible Integration: The underlying proposition for implicit PRM is agnostic to specific ORM training objectives, allowing for instantiation with various methods like DPO or cross-entropy loss.
Good for
- Enhancing Mathematical Problem Solving: Improves the performance of base generation models (e.g., Eurus-2-7B-SFT, Llama-3.1-70B-Instruct, Qwen2.5-7B-Instruct) on complex math benchmarks through weighted Best-of-N sampling.
- Cost-Effective Reward Modeling: Ideal for scenarios where annotating detailed step-by-step process labels is impractical or too expensive, offering a method to obtain process rewards from simpler outcome-based labels.
- Research in Reinforcement Learning from Human Feedback (RLHF): Provides a valuable tool for exploring and implementing process-level reinforcement without the typical data annotation burden.