kevinpro/R-PRM-7B-DPO
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Mar 28, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

The kevinpro/R-PRM-7B-DPO model is a 7.6 billion parameter language model developed by Shuaijie She et al. for Reasoning-Driven Process Reward Modeling (R-PRM). It is specifically designed to enhance the step-by-step evaluation of mathematical reasoning in large language models. This model significantly boosts evaluation accuracy and generalization on mathematical benchmarks like ProcessBench and PRMBench, outperforming strong baselines.

Loading preview...

R-PRM: Reasoning-Driven Process Reward Modeling

The kevinpro/R-PRM-7B-DPO model, developed by Shuaijie She et al., introduces a novel framework for evaluating mathematical reasoning processes in large language models (LLMs). This 7.6 billion parameter model focuses on Reasoning-Driven Process Reward Modeling (R-PRM), which leverages stronger LLMs to generate seed data and optimizes preferences without extensive human annotations.

Key Capabilities and Differentiators

  • Enhanced Mathematical Reasoning Evaluation: R-PRM significantly improves LLMs' ability to evaluate mathematical reasoning step-by-step, offering comprehensive, transparent, and robust assessments.
  • Superior Accuracy and Generalization: The model achieves substantial gains in evaluation accuracy and generalization, outperforming strong baselines on ProcessBench and PRMBench. For instance, R-PRM-7B-DPO achieves 70.4 Avg. F1 on ProcessBench, an improvement of +13.9 F1 over Qwen2.5-Math-7B-PRM800K.
  • Data Efficiency: R-PRM demonstrates exceptional data efficiency, achieving comparable performance to models trained on much larger datasets with only a fraction of the training samples.
  • Improved Policy Guidance: When used to guide policy models, R-PRM consistently enhances reasoning performance across diverse datasets, achieving state-of-the-art (SOTA) results.
  • Effective Best-of-N and Guide Search Strategies: The model improves accuracy by +8.6 points over the Pass@1 baseline in Best-of-N selection and +8.4 points in guiding search strategies for reasoning paths.

Ideal Use Cases

  • Evaluating LLM Mathematical Reasoning: Perfect for developers and researchers needing precise, step-by-step evaluation of mathematical problem-solving by LLMs.
  • Improving LLM Performance in Math: Can be used to guide and enhance the reasoning capabilities of other LLMs in mathematical tasks.
  • Research in Reward Modeling: Offers a scalable and data-efficient solution for process-level reward modeling, especially where human annotations are scarce.