GenPRM-7B: Generative Process Reward Model
GenPRM-7B is a 7.6 billion parameter generative process reward model (PRM) that introduces several innovations for enhanced reasoning and verification. It performs explicit Chain-of-Thought (CoT) reasoning and code verification before making process judgments, and improves Monte Carlo estimation and hard labeling through Relative Progress Estimation (RPE).
Key Capabilities
- State-of-the-Art Verification: As a verifier, GenPRM-7B outperforms classification-based PRMs of comparable size and even surpasses larger models like Qwen2.5-Math-PRM-72B through test-time scaling.
- Superior Critique: In its role as a critic, the model demonstrates significant performance gains, achieving 3.4x greater improvement than DeepSeekR1-Distill-Qwen-7B after three refinement iterations.
- Test-Time Scaling: Supports parallel test-time scaling with majority voting for GenPRM itself, and acts as a verifier or critic for policy models.
- Mathematical Reasoning: Trained on 23K SFT data, including the GenPRM-MATH-Data dataset, using DeepSeek-R1-Distill series as base models, making it particularly adept at mathematical problem-solving and critique.
Good For
- Automated Code Verification: Leveraging its explicit code verification capabilities.
- Process Supervision: Providing detailed, step-by-step feedback and judgment on reasoning processes.
- Improving LLM Outputs: Acting as a critic to refine and enhance the quality of other language models' generated content, especially in complex reasoning tasks.
- Mathematical Problem Solving: Excelling in tasks requiring detailed mathematical reasoning and solution critique.