Name: gaotang/RM-R1-Qwen2.5-Instruct-7B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: gaotang

RM-R1-Qwen2.5-Instruct-7B: A Reasoning Reward Model

The gaotang/RM-R1-Qwen2.5-Instruct-7B is a 7.6 billion parameter model built on the Qwen-2.5-Instruct architecture, developed within the RM-R1 framework. This model redefines reward modeling by treating it as a reasoning task, where it first "thinks out loud" by generating structured rubrics or reasoning traces before making a preference judgment. This approach leads to state-of-the-art performance on public reward model benchmarks while providing fully interpretable justifications for its evaluations.

Key Capabilities

Two-stage Training: Utilizes distillation of approximately 8.7K high-quality reasoning traces (Chain-of-Rubrics) followed by Reinforcement Learning with Verifiable Rewards (RLVR) on about 64K preference pairs.
Interpretable Judgments: Generates detailed rubrics and reasoning traces, offering transparency into its preference decisions.
Flexible Evaluation: Can classify tasks as either 'Reasoning' (math, coding, domain knowledge, multi-step inference) or 'Chat' (open-ended conversation, stylistic rewrites, general helpfulness).

Intended Uses

RLHF / RLAIF: Serves as a plug-and-play reward function for policy optimization in reinforcement learning from human/AI feedback.
Automated Evaluation: Functions as an LLM-as-a-judge for open-domain QA, chat, and complex reasoning tasks.
Research: Provides a valuable tool for studying process supervision, chain-of-thought verification, and rubric generation in AI systems.

Overview

RM-R1-Qwen2.5-Instruct-7B: A Reasoning Reward Model

Key Capabilities

Intended Uses

Full Model Card (README)