Name: PRIME-RL/EurusPRM-Stage1 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: PRIME-RL

EurusPRM-Stage1: Implicit Process Reward Model

EurusPRM-Stage1 is a 7.6 billion parameter reward model developed by PRIME-RL, distinguished by its novel approach to generating process-level rewards. Unlike traditional Process Reward Models (PRMs) that require expensive step-by-step human annotations, EurusPRM-Stage1 leverages Implicit PRM to obtain these rewards at no additional cost, relying solely on more readily available response-level labels.

Key Capabilities

Implicit Process Reward Generation: Derives process rewards by training an Outcome Reward Model (ORM) on response-level labels and calculating log-likelihood ratios during inference, eliminating the need for explicit step-level annotations.
Mathematical Reasoning Optimization: Demonstrates strong performance in mathematical reasoning tasks, as evidenced by its evaluation across benchmarks like MATH, AMC, AIME 2024, OlympiadBench, and Minerva Math.
Flexible Integration: The underlying proposition for implicit PRM is agnostic to specific ORM training objectives, allowing for instantiation with various methods like DPO or cross-entropy loss.

Good for

Enhancing Mathematical Problem Solving: Improves the performance of base generation models (e.g., Eurus-2-7B-SFT, Llama-3.1-70B-Instruct, Qwen2.5-7B-Instruct) on complex math benchmarks through weighted Best-of-N sampling.
Cost-Effective Reward Modeling: Ideal for scenarios where annotating detailed step-by-step process labels is impractical or too expensive, offering a method to obtain process rewards from simpler outcome-based labels.
Research in Reinforcement Learning from Human Feedback (RLHF): Provides a valuable tool for exploring and implementing process-level reinforcement without the typical data annotation burden.

Overview

EurusPRM-Stage1: Implicit Process Reward Model

Key Capabilities

Good for

Full Model Card (README)