Llama-Guard-7b: A Llama 2-Based Content Safeguard Model
Llama-Guard-7b is a 7 billion parameter model built on Llama 2, specifically designed to act as an input-output safeguard for Large Language Models. It classifies content in both user prompts and LLM responses, identifying whether they are safe or unsafe according to a predefined policy. Unlike traditional classifiers, Llama-Guard operates as an LLM, generating textual outputs that detail the safety status and any violating subcategories.
Key Capabilities
- Dual-Direction Moderation: Classifies both incoming user prompts and outgoing LLM responses.
- Policy-Driven Classification: Identifies unsafe content based on a comprehensive, open taxonomy of harms, including Violence & Hate, Sexual Content, Guns & Illegal Weapons, Regulated or Controlled Substances, Suicide & Self Harm, and Criminal Planning.
- Detailed Harm Identification: Not only flags content as unsafe but also specifies the exact subcategories of violation.
- Adaptable Taxonomy: Released with an open taxonomy and risk guidelines, demonstrating high performance and adaptability to different content policies.
- Competitive Performance: Shows strong performance against industry-standard content moderation APIs like OpenAI, Azure Content Safety, and PerspectiveAPI on various benchmarks, including ToxicChat and OpenAI Moderation datasets.
Good for
- Implementing LLM Safety Layers: Ideal for developers looking to integrate robust content moderation directly into their LLM applications.
- Customizing Safety Policies: Useful for organizations that need to adapt content risk guidelines to their specific requirements.
- Researching Content Moderation: Provides a strong baseline for further research and development in automated content safety and harm detection.
- Identifying Specific Harm Types: Excellent for scenarios requiring granular classification of harmful content beyond a simple safe/unsafe flag.