Llama-Guard-7b: A Llama 2-Based Content Safeguard Model

Llama-Guard-7b is a 7 billion parameter model built on Llama 2, specifically designed to act as an input-output safeguard for Large Language Models. It classifies content in both user prompts and LLM responses, identifying whether they are safe or unsafe according to a predefined policy. Unlike traditional classifiers, Llama-Guard operates as an LLM, generating textual outputs that detail the safety status and any violating subcategories.

Key Capabilities

Dual-Direction Moderation: Classifies both incoming user prompts and outgoing LLM responses.
Policy-Driven Classification: Identifies unsafe content based on a comprehensive, open taxonomy of harms, including Violence & Hate, Sexual Content, Guns & Illegal Weapons, Regulated or Controlled Substances, Suicide & Self Harm, and Criminal Planning.
Detailed Harm Identification: Not only flags content as unsafe but also specifies the exact subcategories of violation.
Adaptable Taxonomy: Released with an open taxonomy and risk guidelines, demonstrating high performance and adaptability to different content policies.
Competitive Performance: Shows strong performance against industry-standard content moderation APIs like OpenAI, Azure Content Safety, and PerspectiveAPI on various benchmarks, including ToxicChat and OpenAI Moderation datasets.

Good for

Implementing LLM Safety Layers: Ideal for developers looking to integrate robust content moderation directly into their LLM applications.
Customizing Safety Policies: Useful for organizations that need to adapt content risk guidelines to their specific requirements.
Researching Content Moderation: Provides a strong baseline for further research and development in automated content safety and harm detection.
Identifying Specific Harm Types: Excellent for scenarios requiring granular classification of harmful content beyond a simple safe/unsafe flag.

Overview

Llama-Guard-7b: A Llama 2-Based Content Safeguard Model

Key Capabilities

Good for

Full Model Card (README)