Name: gaotang/RM-R1-DeepSeek-Distilled-Qwen-32B API
Brand: Featherless.ai
Price: 25.00 USD
Availability: InStock
Author: gaotang

RM-R1: Reward Modeling as Reasoning

RM-R1 is a novel training framework for Reasoning Reward Models (ReasRM) that approaches reward modeling as a reasoning task. Unlike traditional scalar or generative reward models, RM-R1 first "thinks out loud" by generating structured rubrics or reasoning traces before determining a preference between two candidate answers. This approach enables fully interpretable justifications for its judgments.

Key Capabilities & Training

Interpretable Judgments: Provides explicit reasoning traces (Chain-of-Rubrics) for its preferences.
State-of-the-Art Performance: Achieves leading performance on public reward model benchmarks.
Two-stage Training: Involves an initial distillation phase using approximately 8.7K high-quality reasoning traces, followed by Reinforcement Learning with Verifiable Rewards (RLVR) on about 64K preference pairs.
Backbone Models: This specific model is a 32 billion parameter variant, built upon Qwen-2.5-Instruct and DeepSeek-distilled checkpoints.

Intended Uses

RLHF / RLAIF: Serves as a direct replacement for existing reward functions in policy optimization.
Automated Evaluation: Functions as an "LLM-as-a-judge" for evaluating open-domain QA, chat, and reasoning tasks.
Research: Valuable for studying process supervision, chain-of-thought verification, and rubric generation methodologies.

Overview

RM-R1: Reward Modeling as Reasoning

Key Capabilities & Training

Intended Uses

Full Model Card (README)