dongboklee/gPRM-14B-merged
dongboklee/gPRM-14B-merged is a 14.8 billion parameter LoRA-merged language model developed by Dong Bok Lee et al. for vLLM inference. It is based on the gPRM-14B architecture and is specifically designed as a reward model for multi-domain test-time scaling. This model's primary application is to review and critique step-by-step solutions to problems, determining the correctness of each step.
Loading preview...
Overview
dongboklee/gPRM-14B-merged is a 14.8 billion parameter model, specifically a LoRA-merged version of the gPRM-14B architecture, optimized for vLLM inference. Developed by Dong Bok Lee and his collaborators, this model functions as a reward model, as detailed in their paper "Rethinking Reward Models for Multi-Domain Test-Time Scaling" (arXiv:2510.00492). Its core capability lies in evaluating the correctness of individual steps within a proposed solution to a given problem.
Key Capabilities
- Solution Step Verification: Critiques and verifies each step in a provided multi-step solution.
- Reward Calculation: Computes a reward score based on the likelihood of "Yes" or "No" responses to the solution's correctness.
- Multi-Domain Application: Designed for test-time scaling across various domains, suggesting adaptability to different problem types.
Good For
- Automated Solution Assessment: Ideal for systems requiring automated evaluation of problem-solving steps.
- Reinforcement Learning from Human Feedback (RLHF) Pipelines: Can be integrated as a reward model component in RLHF setups.
- Educational Tools: Potentially useful in educational platforms to provide feedback on student-generated solutions.