GenPRM-1.5B: A Generative Process Reward Model
GenPRM-1.5B is a 1.5 billion parameter generative process reward model built upon the DeepSeek-R1-Distill series. It is specifically engineered to perform explicit Chain-of-Thought (CoT) reasoning and code verification, enhancing the reliability of process judgments.
Key Capabilities and Innovations
- Explicit CoT Reasoning & Code Verification: GenPRM performs detailed reasoning and verifies code before providing process judgments, ensuring higher accuracy.
- Relative Progress Estimation (RPE): The model leverages RPE to improve Monte Carlo estimation and hard label assignments.
- Test-Time Scaling: GenPRM supports parallel test-time scaling through majority voting, allowing it to function effectively as a verifier or critic for policy models.
- State-of-the-Art Performance: The GenPRM family, including this 1.5B variant, has demonstrated strong performance across multiple benchmarks. As a verifier, it can surpass larger classification-based PRMs like Qwen2.5-Math-PRM-72B via test-time scaling. As a critic, it shows superior critique capabilities, achieving substantial performance gains after refinement iterations.
Use Cases and Strengths
GenPRM-1.5B is particularly well-suited for applications requiring robust mathematical reasoning, code verification, and detailed critique. Its ability to act as a verifier or critic makes it valuable for improving the performance of other language models, especially in complex problem-solving domains. The model's training data includes 23K SFT data from GenPRM-MATH-Data, focusing on mathematical tasks. Further technical details are available in the associated paper.