nvidia/Llama-3.3-Nemotron-70B-Reward-Principle
TEXT GENERATIONConcurrency Cost:4Model Size:70BQuant:FP8Ctx Length:32kPublished:Oct 12, 2025License:nvidia-open-model-licenseArchitecture:Transformer0.0K Open Weights Cold

The nvidia/Llama-3.3-Nemotron-70B-Reward-Principle is a 70 billion parameter reward model developed by NVIDIA, built upon the Meta-Llama-3.3-70B-Instruct foundation. This model is specifically fine-tuned to predict the extent to which LLM-generated responses adhere to user-specified principles, assigning a reward score to the final assistant turn in a conversation up to 4,096 tokens. It excels in evaluating response quality against principles, achieving 76.3% on JudgeBench and 83.6% on RM-Bench, positioning it as a top-performing scalar reward model for assessing LLM outputs.

Loading preview...

Model Overview

The nvidia/Llama-3.3-Nemotron-70B-Reward-Principle is a 70 billion parameter reward model developed by NVIDIA, leveraging the Meta-Llama-3.3-70B-Instruct as its base. Its core function is to evaluate the quality of LLM-generated responses by predicting how well they fulfill a user-specified principle, assigning a scalar reward score. This model processes conversations up to 4,096 tokens, providing a quantitative measure where a higher score indicates greater adherence to the principle.

Key Capabilities

  • Principle-based Response Evaluation: Rates LLM responses based on their alignment with a given principle (e.g., correctness, safety).
  • High Performance on Benchmarks: Achieves an 83.6% overall score on RM-Bench and 76.3% on JudgeBench, demonstrating strong capabilities in evaluating chat, math, code, and safety aspects of responses.
  • Scalar Reward Output: Provides a single float value representing the degree of principle fulfillment, useful for reinforcement learning from human feedback (RLHF) or automated quality control.

Use Cases

  • LLM-as-a-Judge Applications: Ideal for scenarios requiring automated assessment of LLM outputs against specific criteria.
  • Reinforcement Learning: Can be integrated into RLHF pipelines to guide LLM training towards more principle-aligned responses.
  • Content Moderation: Useful for evaluating responses for safety, bias, or other ethical principles.

This model is designed for NVIDIA GPU-accelerated systems, supporting hardware like NVIDIA Ampere and Hopper architectures, and requires at least 2x 80GB GPUs for deployment.