ShieldGemma-9b: Specialized Content Moderation Model
ShieldGemma-9b is a 9 billion parameter model from Google, part of the ShieldGemma series, specifically engineered for safety content moderation. Built on the Gemma 2 architecture, this text-to-text, decoder-only LLM is designed to identify and classify content across four critical harm categories: sexually explicit, dangerous content, hate, and harassment.
Key Capabilities
- Targeted Harm Detection: Specializes in identifying content that violates policies related to sexually explicit material, dangerous content, hate speech, and harassment.
- Text-to-Text Classification: Processes input text and outputs a 'Yes' or 'No' classification indicating policy violation.
- Structured Prompting: Utilizes a specific prompt pattern, incorporating a preamble, user/model content, safety policies, and an epilogue, for optimal performance.
- Dual Use Case Support: Provides distinct guidelines for classifying user-provided content (Prompt-only) and combined user-provided/model-generated content (Prompt-Response).
- Open Weights: Available with open weights, facilitating integration and customization within various AI safety frameworks.
Performance Highlights
ShieldGemma-9b demonstrates strong performance in content moderation benchmarks. For instance, it achieves 0.828 Optimal F1 / 0.894 AU-PRC on internal 'SG Prompt' datasets and 0.753 Optimal F1 / 0.817 AU-PRC on 'SG Response' datasets. It also shows competitive results against models like OpenAI Mod and LlamaGuard on external datasets such as OpenAI Mod and ToxicChat.
Intended Usage
ShieldGemma-9b is primarily intended as a safety content moderator for both human user inputs and AI model outputs. It is a core component of Google's Responsible Generative AI Toolkit, aiming to enhance the safety of AI applications. Developers can integrate it to filter potentially harmful content, ensuring adherence to defined safety principles.
Limitations
While powerful, ShieldGemma-9b shares common LLM limitations. It is highly sensitive to the phrasing of safety principles and may struggle with language ambiguity or nuance. Its performance is also dependent on the representativeness of training and evaluation data for real-world scenarios.