Model Overview
nvidia/Llama-3.3-Nemotron-70B-Reward-Multilingual is a 70 billion parameter reward model developed by NVIDIA, leveraging the Meta-Llama-3.3-70B-Instruct foundation. It is specifically fine-tuned using scaled Bradley-Terry modeling to assess the quality of LLM-generated responses in multilingual conversations. The model processes multi-turn conversations up to 4,096 tokens and outputs a single float value representing the quality of the final assistant turn.
Key Capabilities & Performance
- Response Quality Scoring: Assigns a reward score to LLM-generated responses, where a higher score indicates higher quality. This score is relative to other responses for the same prompt.
- Multilingual Support: Designed to evaluate responses across various languages.
- Benchmark Leader: As of May 15, 2025, it achieves the highest score on RM-Bench at 82.4% and the second-highest on JudgeBench at 69.4% among Bradley-Terry Reward Models.
- Foundation: Built on the Llama 3.3 Transformer architecture.
Use Cases
This model is ideal for:
- LLM Evaluation: Programmatically assessing the quality of responses from other large language models.
- Reinforcement Learning from Human Feedback (RLHF): Providing a reward signal for training or fine-tuning generative LLMs.
- Response Ranking: Comparing and ranking different LLM outputs for a given prompt based on their predicted quality.
Technical Details
The model was trained using the HelpSteer3-Preference dataset, which includes human-annotated preferences. It is optimized for NVIDIA GPU-accelerated systems and supports NVIDIA Ampere, Hopper, and Turing microarchitectures.