launch/ThinkPRM-7B
ThinkPRM-7B by launch is a 7.6 billion parameter generative Process Reward Model (PRM) based on the R1-Distill-Qwen-7B architecture, featuring a 32768 token context length. It is fine-tuned for step-by-step verification of reasoning processes, generating explicit verification chain-of-thought (CoT) by labeling each step. This model is highly data-efficient, requiring significantly less supervision data than traditional discriminative PRMs while excelling at scoring and critiquing solutions in mathematical reasoning, scientific QA, and code generation tasks.
Loading preview...
ThinkPRM-7B: Process Reward Model for Step-by-Step Verification
ThinkPRM-7B is a 7.6 billion parameter generative Process Reward Model (PRM) developed by launch, built upon the R1-Distill-Qwen-7B architecture. Its core innovation lies in its ability to perform step-by-step verification of reasoning processes, such as mathematical solutions, by generating an explicit verification chain-of-thought (CoT) that labels each step.
Key Capabilities
- Step-Level Verification: Provides natural language critiques and correctness judgments for individual steps within a solution prefix.
- Data Efficiency: Achieves strong performance with significantly less supervision data (1K synthetic examples) compared to traditional discriminative PRMs.
- Interpretability: Uses a standard language modeling objective, making its verification process transparent.
- Performance: Demonstrated superior performance over LLM-as-a-judge and discriminative PRM baselines (trained on ~100x more labels) on benchmarks including ProcessBench, MATH-500, AIME '24, GPQA-Diamond, and LiveCodeBench.
Good For
- Scoring Solutions: Assigning step-level or overall scores to candidate solutions, useful for Best-of-N sampling or guiding tree search in reasoning tasks.
- Generating Verification Rationales: Producing detailed CoTs that explain why a step is correct or incorrect, enhancing interpretability.
- Standalone Verification: Evaluating the correctness of problem-solution pairs across domains like mathematical reasoning, scientific question answering, and code generation.
Limitations
- May exhibit overconfidence, with scores clustered near 0 or 1.
- Step label interference can occur, where early incorrect judgments might bias subsequent evaluations.
- Performance can be sensitive to input formatting and prompting.