skylenage-ai/GPRM-4B
skylenage-ai/GPRM-4B is a 4 billion parameter Global Perspective Process Reward Model (GPRM) designed to enhance reasoning verification in long-chain tasks by overcoming the local context limitations of traditional PRMs. Developed by skylenage-ai, it incorporates history-aware evaluation and future-informed reasoning through a 4-D diagnostic framework. This model excels at error localization and validation, making it highly effective for complex reasoning and mathematical problem-solving.
Loading preview...
Overview
skylenage-ai/GPRM-4B is a 4 billion parameter Global Perspective Process Reward Model (GPRM) that significantly advances reasoning verification in complex, multi-step tasks. Unlike traditional Process Reward Models (PRMs) that evaluate steps in isolation, GPRM introduces a "Global Perspective" by considering historical evaluations and anticipating future reasoning impacts.
Key Capabilities
- History-Aware Evaluation: Explicitly conditions on previous steps and their judgments to maintain context.
- Future-Informed Reasoning: Incorporates a look-ahead mechanism to validate steps against subsequent derivations.
- 4-D Diagnostic Framework: Utilizes a structured evaluation across Look-back, Look-ahead, Self-check, and Goal alignment for robust error localization.
- Superior Benchmark Performance: Achieves an overall score of 73.9 on PRMBench and an average F1 score of 74.3 on ProcessBench, outperforming larger models like GPT-4o and o1-mini in specific metrics.
- Enhanced Agent Error Detection: Demonstrates improved accuracy in identifying errors in agent-based tasks, scoring 47.0% average accuracy on Agent Error Bench.
Training Strategy
GPRM-4B was developed using a two-stage progressive training pipeline:
- Structured SFT: Learned 4-dimensional diagnostic reasoning via targeted error injection, using Qwen3-235B-Instruct for annotation.
- GRPO Optimization: Refined its evaluation policy using Group Relative Policy Optimization on hard-mined samples from PRM800K, incorporating complete global context (History + Current + Future).
Good for
- Complex Reasoning Tasks: Ideal for applications requiring meticulous step-by-step verification and error localization in long reasoning chains.
- Mathematical Problem Solving: Excels in benchmarks like ProcessBench (GSM8K, MATH, OlympiadBench) and downstream test-time search for mathematical accuracy.
- Agent Error Detection: Useful for improving the reliability and debugging capabilities of AI agents by identifying and correcting errors in their operational steps.