gaotang/RM-R1-DeepSeek-Distilled-Qwen-32B

TEXT GENERATIONConcurrency Cost:2Model Size:32BQuant:FP8Ctx Length:32kPublished:May 6, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

The gaotang/RM-R1-DeepSeek-Distilled-Qwen-32B is a 32 billion parameter Reasoning Reward Model (ReasRM) based on Qwen-2.5-Instruct and DeepSeek-distilled checkpoints, developed by gaotang. This model is designed to judge the quality of AI chatbot responses by first generating structured rubrics or reasoning traces, then emitting a preference. It is primarily intended for use as a plug-and-play reward function in RLHF/RLAIF, automated evaluation (LLM-as-a-judge), and research into process supervision.

Loading preview...

RM-R1: Reward Modeling as Reasoning

RM-R1 is a novel training framework for Reasoning Reward Models (ReasRM) that approaches reward modeling as a reasoning task. Unlike traditional scalar or generative reward models, RM-R1 first "thinks out loud" by generating structured rubrics or reasoning traces before determining a preference between two candidate answers. This approach enables fully interpretable justifications for its judgments.

Key Capabilities & Training

  • Interpretable Judgments: Provides explicit reasoning traces (Chain-of-Rubrics) for its preferences.
  • State-of-the-Art Performance: Achieves leading performance on public reward model benchmarks.
  • Two-stage Training: Involves an initial distillation phase using approximately 8.7K high-quality reasoning traces, followed by Reinforcement Learning with Verifiable Rewards (RLVR) on about 64K preference pairs.
  • Backbone Models: This specific model is a 32 billion parameter variant, built upon Qwen-2.5-Instruct and DeepSeek-distilled checkpoints.

Intended Uses

  • RLHF / RLAIF: Serves as a direct replacement for existing reward functions in policy optimization.
  • Automated Evaluation: Functions as an "LLM-as-a-judge" for evaluating open-domain QA, chat, and reasoning tasks.
  • Research: Valuable for studying process supervision, chain-of-thought verification, and rubric generation methodologies.