gaotang/RM-R1-DeepSeek-Distilled-Qwen-32B
The gaotang/RM-R1-DeepSeek-Distilled-Qwen-32B is a 32 billion parameter Reasoning Reward Model (ReasRM) based on Qwen-2.5-Instruct and DeepSeek-distilled checkpoints, developed by gaotang. This model is designed to judge the quality of AI chatbot responses by first generating structured rubrics or reasoning traces, then emitting a preference. It is primarily intended for use as a plug-and-play reward function in RLHF/RLAIF, automated evaluation (LLM-as-a-judge), and research into process supervision.
Loading preview...
RM-R1: Reward Modeling as Reasoning
RM-R1 is a novel training framework for Reasoning Reward Models (ReasRM) that approaches reward modeling as a reasoning task. Unlike traditional scalar or generative reward models, RM-R1 first "thinks out loud" by generating structured rubrics or reasoning traces before determining a preference between two candidate answers. This approach enables fully interpretable justifications for its judgments.
Key Capabilities & Training
- Interpretable Judgments: Provides explicit reasoning traces (Chain-of-Rubrics) for its preferences.
- State-of-the-Art Performance: Achieves leading performance on public reward model benchmarks.
- Two-stage Training: Involves an initial distillation phase using approximately 8.7K high-quality reasoning traces, followed by Reinforcement Learning with Verifiable Rewards (RLVR) on about 64K preference pairs.
- Backbone Models: This specific model is a 32 billion parameter variant, built upon Qwen-2.5-Instruct and DeepSeek-distilled checkpoints.
Intended Uses
- RLHF / RLAIF: Serves as a direct replacement for existing reward functions in policy optimization.
- Automated Evaluation: Functions as an "LLM-as-a-judge" for evaluating open-domain QA, chat, and reasoning tasks.
- Research: Valuable for studying process supervision, chain-of-thought verification, and rubric generation methodologies.