gaotang/RM-R1-DeepSeek-Distilled-Qwen-14B
The gaotang/RM-R1-DeepSeek-Distilled-Qwen-14B is a 14 billion parameter Reasoning Reward Model (ReasRM) developed by gaotang, built on a DeepSeek-distilled Qwen-2.5-Instruct backbone. This model is designed to judge candidate answers by first generating structured rubrics or reasoning traces, then emitting a preference, offering interpretable justifications. It is primarily intended for use as a plug-and-play reward function in RLHF/RLAIF for policy optimization and for automated evaluation as an LLM-as-a-judge.
Loading preview...
RM-R1-DeepSeek-Distilled-Qwen-14B: Reasoning Reward Model
This model is a 14 billion parameter variant of the RM-R1 framework, which re-conceptualizes reward modeling as a reasoning task. Developed by gaotang, it utilizes a DeepSeek-distilled Qwen-2.5-Instruct backbone. Unlike traditional scalar or generative reward models, RM-R1 first "thinks out loud" by generating structured rubrics or reasoning traces before expressing a preference, providing fully interpretable justifications.
Key Capabilities
- Interpretable Reward Modeling: Generates explicit reasoning traces or rubrics to justify its preference between two candidate answers.
- Two-Stage Training: Employs a two-stage training process involving distillation of approximately 8.7K high-quality reasoning traces (Chain-of-Rubrics) followed by Reinforcement Learning with Verifiable Rewards (RLVR) on about 64K preference pairs.
- State-of-the-Art Performance: Achieves competitive performance on public reward model benchmarks while offering transparency in its decision-making.
Good For
- RLHF / RLAIF: Serves as a direct, plug-and-play reward function for optimizing language model policies.
- Automated Evaluation: Ideal for use as an LLM-as-a-judge in tasks like open-domain QA, chat, and general reasoning, providing detailed feedback.
- Research: Valuable for studying process supervision, chain-of-thought verification, and rubric generation techniques in AI.