gaotang/RM-R1-DeepSeek-Distilled-Qwen-7B
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:May 6, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

The gaotang/RM-R1-DeepSeek-Distilled-Qwen-7B model is a 7 billion parameter Reasoning Reward Model (ReasRM) based on the Qwen-2.5-Instruct architecture, developed by Gaotang and others from UIUC. It is designed to judge candidate answers by generating structured rubrics or reasoning traces, offering interpretable justifications for its preferences. This model excels as a plug-and-play reward function for RLHF/RLAIF, automated evaluation (LLM-as-a-judge), and research into process supervision. It achieves state-of-the-art performance on public reward modeling benchmarks by casting reward modeling as a reasoning task.

Loading preview...