nvidia/Llama-3.3-Nemotron-70B-Reward-Multilingual

Hugging Face
TEXT GENERATIONConcurrency Cost:4Model Size:70BQuant:FP8Ctx Length:32kPublished:May 28, 2025License:nvidia-open-model-licenseArchitecture:Transformer0.0K Open Weights Warm

The nvidia/Llama-3.3-Nemotron-70B-Reward-Multilingual is a 70 billion parameter reward model developed by NVIDIA, built upon the Meta-Llama-3.3-70B-Instruct foundation. It is fine-tuned using scaled Bradley-Terry modeling to predict the quality of LLM-generated responses in a multilingual context. This model excels at assigning reward scores to assistant turns in conversations, indicating response quality, and achieves top performance on RM-Bench (82.4%) and JudgeBench (69.4%) among Bradley-Terry Reward Models.

Loading preview...

Model Overview

nvidia/Llama-3.3-Nemotron-70B-Reward-Multilingual is a 70 billion parameter reward model developed by NVIDIA, leveraging the Meta-Llama-3.3-70B-Instruct foundation. It is specifically fine-tuned using scaled Bradley-Terry modeling to assess the quality of LLM-generated responses in multilingual conversations. The model processes multi-turn conversations up to 4,096 tokens and outputs a single float value representing the quality of the final assistant turn.

Key Capabilities & Performance

  • Response Quality Scoring: Assigns a reward score to LLM-generated responses, where a higher score indicates higher quality. This score is relative to other responses for the same prompt.
  • Multilingual Support: Designed to evaluate responses across various languages.
  • Benchmark Leader: As of May 15, 2025, it achieves the highest score on RM-Bench at 82.4% and the second-highest on JudgeBench at 69.4% among Bradley-Terry Reward Models.
  • Foundation: Built on the Llama 3.3 Transformer architecture.

Use Cases

This model is ideal for:

  • LLM Evaluation: Programmatically assessing the quality of responses from other large language models.
  • Reinforcement Learning from Human Feedback (RLHF): Providing a reward signal for training or fine-tuning generative LLMs.
  • Response Ranking: Comparing and ranking different LLM outputs for a given prompt based on their predicted quality.

Technical Details

The model was trained using the HelpSteer3-Preference dataset, which includes human-annotated preferences. It is optimized for NVIDIA GPU-accelerated systems and supports NVIDIA Ampere, Hopper, and Turing microarchitectures.