launch/ThinkPRM-1.5B

Warm
Public
1.5B
BF16
32768
1
Apr 25, 2025
License: apache-2.0
Hugging Face

ThinkPRM-1.5B by launch is a 1.5 billion parameter Process Reward Model (PRM) based on the R1-Distill-Qwen-1.5B architecture, designed for step-by-step verification of reasoning processes. It generates explicit verification chain-of-thought (CoT) by labeling each step, requiring significantly less supervision data than traditional discriminative PRMs. This model excels at providing step-level verification scores and critiques for solutions in mathematical reasoning, scientific QA, and code generation tasks, with a notable context length of 131072 tokens.

Overview

ThinkPRM-1.5B: Process Reward Model for Step-by-Step Verification

ThinkPRM-1.5B is a 1.5 billion parameter Process Reward Model (PRM) developed by launch, built upon the R1-Distill-Qwen-1.5B architecture. Its core function is to perform step-by-step verification of reasoning processes, such as mathematical solutions, by generating an explicit chain-of-thought (CoT) that critiques and labels each step.

Key Capabilities

  • Data-Efficient Verification: Achieves strong performance with significantly less supervision data (1K synthetic examples) compared to traditional discriminative PRMs.
  • Generative Critiques: Provides natural language critiques and correctness judgments for each step in a solution prefix, enhancing interpretability.
  • Superior Performance: Demonstrated better performance than LLM-as-a-judge and discriminative PRM baselines (trained on ~100x more labels) across benchmarks like ProcessBench, MATH-500, AIME '24, GPQA-Diamond, and LiveCodeBench.
  • Scalable Process Verification: Uses a standard language modeling objective, allowing for the generation of longer or multiple verification CoTs.

Good for

  • Scoring Solutions: Assigning step-level or overall scores to candidate solutions for ranking in Best-of-N sampling or guiding tree search.
  • Generating Verification Rationales: Producing detailed explanations of why a particular step is correct or incorrect.
  • Standalone Verification: Evaluating the correctness of problem-solution pairs in domains like mathematical reasoning, scientific QA, and code generation.