Overview

Llama Guard 3 is an 8 billion parameter model from Meta, built on the Llama 3.1 architecture and specifically fine-tuned for content safety classification. It functions as an LLM itself, generating text outputs that indicate whether a given prompt or response is safe or unsafe, and if unsafe, lists the violated content categories. This model is aligned with the MLCommons standardized hazards taxonomy, covering 14 distinct categories, including specialized detection for Code Interpreter Abuse.

Key Capabilities

Comprehensive Content Moderation: Classifies content in both LLM inputs (prompts) and outputs (responses) against 14 hazard categories, such as Violent Crimes, Child Sexual Exploitation, Hate, and Elections.
Multilingual Support: Provides content safety classification in 8 languages: English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai.
Tool Call Safety: Optimized to support safety and security for search and code interpreter tool calls, detecting potential abuse.
Improved Performance: Demonstrates higher F1 scores and significantly lower false positive rates compared to Llama Guard 2 and GPT-4 across English, multilingual, and tool use evaluations.

Good For

Enhancing LLM Safety: Ideal for integrating into systems using Llama 3.1 or other LLMs to safeguard against harmful content in user interactions.
Multilingual Applications: Suitable for applications requiring content moderation across a diverse set of languages.
Tool-Augmented LLMs: Particularly useful for securing LLM applications that leverage search or code interpreter tools, preventing misuse or abuse.