launch/ThinkPRM-14B

TEXT GENERATIONConcurrency Cost:1Model Size:14.8BQuant:FP8Ctx Length:32kPublished:Apr 25, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

ThinkPRM-14B is a 14.8 billion parameter generative Process Reward Model (PRM) developed by launch, based on the R1-Distill-Qwen-14B architecture. It is fine-tuned to perform step-by-step verification of reasoning processes by generating explicit verification chain-of-thought (CoT) with step-level labeling. This model is highly data-efficient, requiring significantly less supervision data than traditional discriminative PRMs while achieving strong performance. It excels at scoring solutions, generating detailed verification rationales, and evaluating problem-solution pairs across mathematical reasoning, scientific QA, and code generation tasks.

Loading preview...

ThinkPRM-14B: Generative Process Reward Model

ThinkPRM-14B is a 14.8 billion parameter generative Process Reward Model (PRM) built upon the R1-Distill-Qwen-14B architecture. Its core function is to provide step-level verification scores and critiques for reasoning processes, such as mathematical solutions, by generating an explicit chain-of-thought (CoT) that labels each step as correct or incorrect.

Key Capabilities & Features

  • Step-by-Step Verification: Generates natural language critiques and correctness judgments for each step in a solution prefix.
  • Data Efficiency: Achieves strong performance with significantly less supervision data (1K synthetic examples) compared to traditional discriminative PRMs.
  • Interpretability: Uses a standard language modeling objective, making its verification process transparent and scalable.
  • Superior Performance: Outperforms LLM-as-a-judge and discriminative PRM baselines (trained on ~100x more labels) on benchmarks like ProcessBench, MATH-500, AIME '24, GPQA-Diamond, and LiveCodeBench.
  • High Context Length: Supports a context length of 131,072 tokens.

Ideal Use Cases

  • Solution Scoring: Assigning step-level or overall scores to candidate solutions for ranking in Best-of-N sampling or guiding tree search in reasoning tasks.
  • Verification Rationale Generation: Producing detailed chain-of-thought verifications that explain why a particular step is correct or incorrect, enhancing interpretability.
  • Standalone Evaluation: Directly evaluating the correctness of a given problem-solution pair in domains like mathematical reasoning, scientific QA, and code generation.