Overview
Overview
Meta Llama Guard 2 is an 8 billion parameter safeguard model built on the Llama 3 architecture, developed by Meta. Its primary function is to classify content in both user prompts and LLM responses, determining if they are safe or unsafe according to a predefined harm taxonomy. The model outputs text indicating the safety status and any violated content categories.
Key Capabilities
- Content Classification: Identifies and categorizes unsafe content in LLM inputs and outputs.
- Harm Taxonomy: Trained on 11 specific harm categories, including Violent Crimes, Sex-Related Crimes, Child Sexual Exploitation, Hate, and Suicide & Self-Harm, aligned with the MLCommons taxonomy.
- Performance: Achieves a significantly higher F1 score (0.915) and AUPRC (0.974) compared to Llama Guard (0.665 F1, 0.854 AUPRC) and outperforms other moderation APIs like OpenAI and Azure on Meta's internal test set, while maintaining a low false positive rate.
- Adaptability: Demonstrates strong adaptability to other policies, providing a superior tradeoff between F1 score and False Positive Rate on datasets like XSTest and OpenAI Moderation.
Good For
- LLM Safeguarding: Implementing robust content moderation for LLM applications by classifying prompts and responses.
- Policy Alignment: Developers seeking a moderation solution aligned with the MLCommons taxonomy for industry-standard safety evaluations.
- Custom Moderation: Serving as a base model that can be fine-tuned for specific use cases and custom moderation policies to achieve better performance.