Overview
ShieldGemma-2b: Content Moderation LLM
ShieldGemma-2b is a 2.6 billion parameter, English-only, decoder-only large language model from Google, part of the ShieldGemma series built on the Gemma 2 architecture. Its primary function is safety content moderation, classifying text against predefined policies for four harm categories: sexually explicit content, dangerous content, hate speech, and harassment.
Key Capabilities
- Text-to-Text Classification: Determines if input or output text violates safety policies, returning 'Yes' or 'No'.
- Policy-Driven Moderation: Utilizes a specific prompt format, acting as a "policy expert" to evaluate text based on provided guidelines.
- Dual Use Cases: Supports both Prompt-only (input filtering) and Prompt-Response (output filtering) content classification.
- Performance: Benchmarked against internal and external datasets, showing competitive performance in moderation tasks compared to other models like OpenAI Mod API, LlamaGuard, and GPT-4.
Intended Use Cases
- Input Filtering: Assessing user prompts for policy violations before processing.
- Output Filtering: Evaluating model-generated responses to ensure compliance with safety guidelines.
- Responsible AI Toolkit: Integrated as a component within Google's Responsible Generative AI Toolkit to enhance AI application safety.
Limitations
Like other LLMs, ShieldGemma-2b is sensitive to the phrasing of safety principles and may struggle with language ambiguity. Its performance relies heavily on the clarity and specificity of the provided moderation guidelines.