Reward-Reasoning/RRM-32B

TEXT GENERATIONConcurrency Cost:2Model Size:32.8BQuant:FP8Ctx Length:32kPublished:May 20, 2025Architecture:Transformer0.0K Cold

Reward-Reasoning/RRM-32B is a 32.8 billion parameter Reward Reasoning Model (RRM) built on the Qwen2 architecture, designed to frame reward modeling as a reasoning task. It generates explicit chain-of-thought processes before assigning rewards, allowing adaptive computational resource allocation. Trained via Reward Reasoning via Reinforcement Learning without supervised reasoning traces, RRM-32B consistently outperforms baseline reward models in accuracy across diverse domains, including reasoning, general knowledge, and human preference alignment, making it ideal for reward-guided inference and post-training LLMs.

Loading preview...

Reward Reasoning Model (RRM-32B)

RRM-32B is a 32.8 billion parameter model that redefines reward modeling by integrating a reasoning process. Unlike traditional reward models, RRMs generate a detailed chain-of-thought analysis before assigning rewards, enabling them to adaptively allocate computational resources based on the complexity of the evaluation scenario. This approach allows for more nuanced and accurate preference judgments.

Key Capabilities & Features

  • Chain-of-Thought Reward Modeling: Frames reward assignment as a reasoning task, producing explicit analysis before generating rewards.
  • Reinforcement Learning Training: Utilizes a novel Reward Reasoning via Reinforcement Learning framework, allowing the model to self-evolve sophisticated reasoning capabilities without requiring supervised reasoning traces.
  • Enhanced Accuracy: Demonstrates superior performance compared to strong baseline reward models across various domains, including reasoning, general knowledge, and alignment with human preferences.
  • Adaptive Compute Utilization: Can dynamically scale test-time compute through parallel and sequential reasoning steps to optimize performance.
  • Multi-Response Rewarding: Supports strategies like ELO rating and knockout tournaments for flexible reward allocation.

Performance Highlights

Evaluations show RRM-32B (voting@16) achieving 91.9% on RewardBench and 80.2% on PandaLM Test for agreement with human preference. On the PPE Benchmark for binary preference classification, RRM-32B (voting@5) scores 81.7% Overall, with 95.4% on MATH and 81.3% on MMLU-Pro.

Good For

  • Reward-guided Best-of-N Inference: Effectively guides the selection of optimal responses from multiple candidates.
  • Post-training LLMs: Provides high-quality preference signals for fine-tuning large language models using methods like DPO or RL.