launch/ThinkPRM-14B
ThinkPRM-14B is a 14.8 billion parameter generative Process Reward Model (PRM) developed by launch, based on the R1-Distill-Qwen-14B architecture. It is fine-tuned to perform step-by-step verification of reasoning processes by generating explicit verification chain-of-thought (CoT) with step-level labeling. This model is highly data-efficient, requiring significantly less supervision data than traditional discriminative PRMs while achieving strong performance. It excels at scoring solutions, generating detailed verification rationales, and evaluating problem-solution pairs across mathematical reasoning, scientific QA, and code generation tasks.
Loading preview...
ThinkPRM-14B: Generative Process Reward Model
ThinkPRM-14B is a 14.8 billion parameter generative Process Reward Model (PRM) built upon the R1-Distill-Qwen-14B architecture. Its core function is to provide step-level verification scores and critiques for reasoning processes, such as mathematical solutions, by generating an explicit chain-of-thought (CoT) that labels each step as correct or incorrect.
Key Capabilities & Features
- Step-by-Step Verification: Generates natural language critiques and correctness judgments for each step in a solution prefix.
- Data Efficiency: Achieves strong performance with significantly less supervision data (1K synthetic examples) compared to traditional discriminative PRMs.
- Interpretability: Uses a standard language modeling objective, making its verification process transparent and scalable.
- Superior Performance: Outperforms LLM-as-a-judge and discriminative PRM baselines (trained on ~100x more labels) on benchmarks like ProcessBench, MATH-500, AIME '24, GPQA-Diamond, and LiveCodeBench.
- High Context Length: Supports a context length of 131,072 tokens.
Ideal Use Cases
- Solution Scoring: Assigning step-level or overall scores to candidate solutions for ranking in Best-of-N sampling or guiding tree search in reasoning tasks.
- Verification Rationale Generation: Producing detailed chain-of-thought verifications that explain why a particular step is correct or incorrect, enhancing interpretability.
- Standalone Evaluation: Directly evaluating the correctness of a given problem-solution pair in domains like mathematical reasoning, scientific QA, and code generation.