Name: gaotang/RM-R1-Qwen2.5-Instruct-14B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: gaotang

RM-R1-Qwen2.5-Instruct-14B: A Reasoning Reward Model

The gaotang/RM-R1-Qwen2.5-Instruct-14B is a 14.8 billion parameter model that implements the RM-R1 (Reasoning Reward Model) framework. Unlike traditional reward models, RM-R1 casts reward modeling as a reasoning task, enabling it to judge candidate answers by first generating explicit reasoning traces or rubrics before making a preference decision. This approach provides interpretable justifications for its evaluations.

Key Capabilities

Interpretable Reward Modeling: Generates structured rubrics or reasoning traces to explain its preference judgments.
Two-Stage Training: Utilizes distillation of ~8.7K high-quality reasoning traces (Chain-of-Rubrics) followed by Reinforcement Learning with Verifiable Rewards (RLVR) on ~64K preference pairs.
State-of-the-Art Performance: Achieves strong performance on public reward model benchmarks while offering transparency.

Intended Uses

RLHF / RLAIF: Can serve as a plug-and-play reward function for policy optimization in large language models.
Automated Evaluation: Functions as an LLM-as-a-judge for tasks such as open-domain QA, chat, and complex reasoning.
Research: Useful for studying process supervision, chain-of-thought verification, and rubric generation techniques.

Overview

RM-R1-Qwen2.5-Instruct-14B: A Reasoning Reward Model

Key Capabilities

Intended Uses

Full Model Card (README)