gaotang/RM-R1-Qwen2.5-Instruct-14B
The gaotang/RM-R1-Qwen2.5-Instruct-14B is a 14.8 billion parameter Reasoning Reward Model (ReasRM) based on the Qwen-2.5-Instruct architecture, developed by Gaotang and others. This model is trained using a two-stage process involving distillation of reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR). It is specifically designed to judge two candidate answers by first generating structured rubrics or reasoning traces, then emitting a preference, offering interpretable justifications for its evaluations.
Loading preview...
RM-R1-Qwen2.5-Instruct-14B: A Reasoning Reward Model
The gaotang/RM-R1-Qwen2.5-Instruct-14B is a 14.8 billion parameter model that implements the RM-R1 (Reasoning Reward Model) framework. Unlike traditional reward models, RM-R1 casts reward modeling as a reasoning task, enabling it to judge candidate answers by first generating explicit reasoning traces or rubrics before making a preference decision. This approach provides interpretable justifications for its evaluations.
Key Capabilities
- Interpretable Reward Modeling: Generates structured rubrics or reasoning traces to explain its preference judgments.
- Two-Stage Training: Utilizes distillation of ~8.7K high-quality reasoning traces (Chain-of-Rubrics) followed by Reinforcement Learning with Verifiable Rewards (RLVR) on ~64K preference pairs.
- State-of-the-Art Performance: Achieves strong performance on public reward model benchmarks while offering transparency.
Intended Uses
- RLHF / RLAIF: Can serve as a plug-and-play reward function for policy optimization in large language models.
- Automated Evaluation: Functions as an LLM-as-a-judge for tasks such as open-domain QA, chat, and complex reasoning.
- Research: Useful for studying process supervision, chain-of-thought verification, and rubric generation techniques.