gaotang/RM-R1-DeepSeek-Distilled-Qwen-14B

TEXT GENERATIONConcurrency Cost:1Model Size:14BQuant:FP8Ctx Length:32kPublished:May 6, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

The gaotang/RM-R1-DeepSeek-Distilled-Qwen-14B is a 14 billion parameter Reasoning Reward Model (ReasRM) developed by gaotang, built on a DeepSeek-distilled Qwen-2.5-Instruct backbone. This model is designed to judge candidate answers by first generating structured rubrics or reasoning traces, then emitting a preference, offering interpretable justifications. It is primarily intended for use as a plug-and-play reward function in RLHF/RLAIF for policy optimization and for automated evaluation as an LLM-as-a-judge.

Loading preview...

RM-R1-DeepSeek-Distilled-Qwen-14B: Reasoning Reward Model

This model is a 14 billion parameter variant of the RM-R1 framework, which re-conceptualizes reward modeling as a reasoning task. Developed by gaotang, it utilizes a DeepSeek-distilled Qwen-2.5-Instruct backbone. Unlike traditional scalar or generative reward models, RM-R1 first "thinks out loud" by generating structured rubrics or reasoning traces before expressing a preference, providing fully interpretable justifications.

Key Capabilities

  • Interpretable Reward Modeling: Generates explicit reasoning traces or rubrics to justify its preference between two candidate answers.
  • Two-Stage Training: Employs a two-stage training process involving distillation of approximately 8.7K high-quality reasoning traces (Chain-of-Rubrics) followed by Reinforcement Learning with Verifiable Rewards (RLVR) on about 64K preference pairs.
  • State-of-the-Art Performance: Achieves competitive performance on public reward model benchmarks while offering transparency in its decision-making.

Good For

  • RLHF / RLAIF: Serves as a direct, plug-and-play reward function for optimizing language model policies.
  • Automated Evaluation: Ideal for use as an LLM-as-a-judge in tasks like open-domain QA, chat, and general reasoning, providing detailed feedback.
  • Research: Valuable for studying process supervision, chain-of-thought verification, and rubric generation techniques in AI.