RM-R1-Qwen2.5-Instruct-7B: A Reasoning Reward Model
The gaotang/RM-R1-Qwen2.5-Instruct-7B is a 7.6 billion parameter model built on the Qwen-2.5-Instruct architecture, developed within the RM-R1 framework. This model redefines reward modeling by treating it as a reasoning task, where it first "thinks out loud" by generating structured rubrics or reasoning traces before making a preference judgment. This approach leads to state-of-the-art performance on public reward model benchmarks while providing fully interpretable justifications for its evaluations.
Key Capabilities
- Two-stage Training: Utilizes distillation of approximately 8.7K high-quality reasoning traces (Chain-of-Rubrics) followed by Reinforcement Learning with Verifiable Rewards (RLVR) on about 64K preference pairs.
- Interpretable Judgments: Generates detailed rubrics and reasoning traces, offering transparency into its preference decisions.
- Flexible Evaluation: Can classify tasks as either 'Reasoning' (math, coding, domain knowledge, multi-step inference) or 'Chat' (open-ended conversation, stylistic rewrites, general helpfulness).
Intended Uses
- RLHF / RLAIF: Serves as a plug-and-play reward function for policy optimization in reinforcement learning from human/AI feedback.
- Automated Evaluation: Functions as an LLM-as-a-judge for open-domain QA, chat, and complex reasoning tasks.
- Research: Provides a valuable tool for studying process supervision, chain-of-thought verification, and rubric generation in AI systems.