nvidia/Qwen3-Nemotron-32B-GenRM-Principle
TEXT GENERATIONConcurrency Cost:2Model Size:32BQuant:FP8Ctx Length:32kPublished:Oct 12, 2025License:nvidia-open-model-licenseArchitecture:Transformer0.0K Open Weights Cold

The nvidia/Qwen3-Nemotron-32B-GenRM-Principle is a 32 billion parameter large language model, built upon the Qwen3 foundation, specifically fine-tuned as a Generative Reward Model. It predicts the extent to which LLM-generated responses fulfill user-specified principles by assigning a reward score. This model achieves top performance on both the JudgeBench (81.4%) and RM-Bench (86.2%) benchmarks, making it highly effective for evaluating LLM response quality against defined criteria.

Loading preview...

Model Overview

The nvidia/Qwen3-Nemotron-32B-GenRM-Principle is a 32 billion parameter Generative Reward Model (GenRM) developed by NVIDIA, leveraging the Qwen3-32B architecture. Its core function is to evaluate the quality of LLM-generated responses based on user-specified principles, assigning a numerical reward score. A higher score indicates a greater fulfillment of the given principle.

Key Capabilities

  • Principle-based Evaluation: Rates LLM responses against explicit user-defined principles (e.g., "correctness", "helpfulness").
  • Reward Scoring: Outputs a single float value representing the degree of principle fulfillment.
  • Benchmark Performance: Achieves 81.4% on JudgeBench and 86.2% on RM-Bench (as of Sep 24, 2025), positioning it as a leading GenRM for these benchmarks.
  • Foundation: Built on the robust Qwen3-32B model, ensuring strong underlying language understanding.

Use Cases

This model is ideal for applications requiring automated evaluation of LLM outputs, particularly for:

  • LLM Alignment: Fine-tuning and aligning other LLMs to adhere to specific behavioral or quality principles.
  • Response Ranking: Comparing and ranking multiple LLM responses based on their adherence to desired criteria.
  • Quality Assurance: Automatically assessing the quality and principle compliance of generated text in various domains like chat, math, code, and safety.