gaotang/RM-R1-DeepSeek-Distilled-Qwen-32B
TEXT GENERATIONConcurrency Cost:2Model Size:32BQuant:FP8Ctx Length:32kPublished:May 6, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

The gaotang/RM-R1-DeepSeek-Distilled-Qwen-32B is a 32 billion parameter Reasoning Reward Model (ReasRM) based on Qwen-2.5-Instruct and DeepSeek-distilled checkpoints, developed by gaotang. This model is designed to judge the quality of AI chatbot responses by first generating structured rubrics or reasoning traces, then emitting a preference. It is primarily intended for use as a plug-and-play reward function in RLHF/RLAIF, automated evaluation (LLM-as-a-judge), and research into process supervision.

Loading preview...