Model Overview
NVIDIA's Llama-3.1-Nemotron-Safety-Guard-8B-v3 is an 8 billion parameter multilingual content safety model designed to moderate human-LLM interactions. Built upon Meta's Llama-3.1-8B-Instruct, this model has been LoRa-tuned by NVIDIA to align with its content safety risk taxonomy.
Key Capabilities
- Content Moderation: Classifies user prompts and LLM responses as 'safe' or 'unsafe'.
- Category Identification: For unsafe content, it provides a list of violated safety categories.
- Multilingual Support: Officially supports 9 languages (English, Spanish, Mandarin, German, French, Hindi, Japanese, Arabic, Thai) and shows zero-shot generalization across over 20 languages.
- Customizable Taxonomy: Can be prompted with a custom instruction and taxonomy of unsafe risks, allowing for novel category detection beyond its training data.
- Commercial Use: Ready for commercial deployment.
Training and Performance
The model was trained using the synthetically curated Nemotron-Safety-Guard-Dataset-v3, developed via the CultureGuard pipeline. It achieves strong evaluation scores across various safety datasets, including 85.32 on Nemotron-Safety-Guard-Dataset-v3 and 96.79 on Aya Red-teaming.
Use Cases
This model is intended for developers and researchers building LLMs who require robust, multilingual content safety mechanisms to ensure responsible AI deployment. It outputs safety assessments in a structured JSON format, indicating user and response safety status along with relevant categories.