RM-R1-DeepSeek-Distilled-Qwen-7B: Reasoning Reward Model
The gaotang/RM-R1-DeepSeek-Distilled-Qwen-7B is a 7 billion parameter Reasoning Reward Model (ReasRM) that redefines reward modeling by treating it as a reasoning task. Developed by Gaotang and the RM-R1-UIUC team, this model judges the quality of two candidate answers by first generating explicit reasoning traces or structured rubrics, then articulating its preference. This approach provides fully interpretable justifications, setting it apart from traditional scalar or generative reward models.
Key Capabilities & Training
- Interpretable Judgments: Unlike other reward models, RM-R1 provides clear, structured reasoning for its preferences, enhancing transparency and trustworthiness.
- State-of-the-Art Performance: Achieves leading performance on public reward modeling benchmarks by leveraging a two-stage training framework.
- Two-Stage Training: The model undergoes a distillation phase using approximately 8.7K high-quality reasoning traces (Chain-of-Rubrics), followed by Reinforcement Learning with Verifiable Rewards (RLVR) on about 64K preference pairs.
- DeepSeek Distillation: This specific variant incorporates DeepSeek-distilled checkpoints, building upon the Qwen-2.5-Instruct backbone.
Ideal Use Cases
- RLHF / RLAIF: Serves as a robust, plug-and-play reward function for policy optimization in reinforcement learning from human/AI feedback.
- Automated Evaluation: Functions effectively as an LLM-as-a-judge for tasks like open-domain QA, chat, and complex reasoning, providing detailed evaluations.
- Research: Valuable for studying process supervision, chain-of-thought verification, and the generation of rubrics in AI systems.