Name: gaotang/RM-R1-DeepSeek-Distilled-Qwen-14B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: gaotang

RM-R1-DeepSeek-Distilled-Qwen-14B: Reasoning Reward Model

This model is a 14 billion parameter variant of the RM-R1 framework, which re-conceptualizes reward modeling as a reasoning task. Developed by gaotang, it utilizes a DeepSeek-distilled Qwen-2.5-Instruct backbone. Unlike traditional scalar or generative reward models, RM-R1 first "thinks out loud" by generating structured rubrics or reasoning traces before expressing a preference, providing fully interpretable justifications.

Key Capabilities

Interpretable Reward Modeling: Generates explicit reasoning traces or rubrics to justify its preference between two candidate answers.
Two-Stage Training: Employs a two-stage training process involving distillation of approximately 8.7K high-quality reasoning traces (Chain-of-Rubrics) followed by Reinforcement Learning with Verifiable Rewards (RLVR) on about 64K preference pairs.
State-of-the-Art Performance: Achieves competitive performance on public reward model benchmarks while offering transparency in its decision-making.

Good For

RLHF / RLAIF: Serves as a direct, plug-and-play reward function for optimizing language model policies.
Automated Evaluation: Ideal for use as an LLM-as-a-judge in tasks like open-domain QA, chat, and general reasoning, providing detailed feedback.
Research: Valuable for studying process supervision, chain-of-thought verification, and rubric generation techniques in AI.

Overview

RM-R1-DeepSeek-Distilled-Qwen-14B: Reasoning Reward Model

Key Capabilities

Good For

Full Model Card (README)