nvidia/Qwen3-Nemotron-235B-A22B-GenRM

TEXT GENERATIONConcurrency Cost:4Model Size:235BQuant:FP8Ctx Length:32kPublished:Dec 3, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

Qwen3-Nemotron-235B-A22B-GenRM is a 235 billion parameter Generative Reward Model (GenRM) developed by NVIDIA, built upon the Qwen3-235B-A22B-Thinking-2507 foundation. This model is specifically fine-tuned to evaluate the quality of AI assistant responses, providing individual helpfulness scores and ranking scores for candidate responses. It is designed for use in Reinforcement Learning from Human Feedback (RLHF) training, such as for the NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 model, and supports a maximum context length of 128k tokens.

Loading preview...

Model Overview

NVIDIA's Qwen3-Nemotron-235B-A22B-GenRM is a 235 billion parameter Generative Reward Model (GenRM) built on the Qwen3 architecture, specifically using the Qwen3-235B-A22B-Thinking-2507 foundation. Its primary function is to evaluate the quality of AI assistant responses by providing helpfulness scores for individual responses and a ranking score between two candidates, given a conversation history and a new user request. This model is integral to the Reinforcement Learning from Human Feedback (RLHF) training process, notably for models like NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.

Key Capabilities

  • Response Evaluation: Assesses the quality of AI assistant responses, outputting individual helpfulness scores (1-5) and comparative ranking scores (1-6).
  • RLHF Integration: Designed to facilitate the fine-tuning of other language models through RLHF.
  • High Performance: Optimized for NVIDIA GPU-accelerated systems, leveraging hardware and software frameworks like CUDA for faster training and inference.
  • Extensive Context: Supports an input context of up to 128k tokens.

Performance Benchmarks

The model demonstrates strong performance across various evaluation suites:

  • RM-Bench: Achieves an 87.3 Overall score, with high scores in Math (96.9) and Safety (94.4).
  • JudgeBench: Scores an 87.4 Overall, with notable results in Reasoning (95.9) and Code (95.2).

Use Cases

This model is ideal for developers and researchers focused on:

  • Training Reward Models: Directly applicable for use in RLHF pipelines to improve the alignment and quality of generative AI models.
  • Automated Response Quality Assessment: Can be integrated into systems requiring automated evaluation of chatbot or assistant outputs.
  • Research in AI Alignment: Provides a robust tool for studying and implementing preference-based learning.