launch/ThinkPRM-7B

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:May 17, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

ThinkPRM-7B by launch is a 7.6 billion parameter generative Process Reward Model (PRM) based on the R1-Distill-Qwen-7B architecture, featuring a 32768 token context length. It is fine-tuned for step-by-step verification of reasoning processes, generating explicit verification chain-of-thought (CoT) by labeling each step. This model is highly data-efficient, requiring significantly less supervision data than traditional discriminative PRMs while excelling at scoring and critiquing solutions in mathematical reasoning, scientific QA, and code generation tasks.

Loading preview...

ThinkPRM-7B: Process Reward Model for Step-by-Step Verification

ThinkPRM-7B is a 7.6 billion parameter generative Process Reward Model (PRM) developed by launch, built upon the R1-Distill-Qwen-7B architecture. Its core innovation lies in its ability to perform step-by-step verification of reasoning processes, such as mathematical solutions, by generating an explicit verification chain-of-thought (CoT) that labels each step.

Key Capabilities

  • Step-Level Verification: Provides natural language critiques and correctness judgments for individual steps within a solution prefix.
  • Data Efficiency: Achieves strong performance with significantly less supervision data (1K synthetic examples) compared to traditional discriminative PRMs.
  • Interpretability: Uses a standard language modeling objective, making its verification process transparent.
  • Performance: Demonstrated superior performance over LLM-as-a-judge and discriminative PRM baselines (trained on ~100x more labels) on benchmarks including ProcessBench, MATH-500, AIME '24, GPQA-Diamond, and LiveCodeBench.

Good For

  • Scoring Solutions: Assigning step-level or overall scores to candidate solutions, useful for Best-of-N sampling or guiding tree search in reasoning tasks.
  • Generating Verification Rationales: Producing detailed CoTs that explain why a step is correct or incorrect, enhancing interpretability.
  • Standalone Verification: Evaluating the correctness of problem-solution pairs across domains like mathematical reasoning, scientific question answering, and code generation.

Limitations

  • May exhibit overconfidence, with scores clustered near 0 or 1.
  • Step label interference can occur, where early incorrect judgments might bias subsequent evaluations.
  • Performance can be sensitive to input formatting and prompting.