google/shieldgemma-2b

Cold
Public
2.6B
BF16
8192
License: gemma
Hugging Face
Gated
Overview

ShieldGemma-2b: Content Moderation LLM

ShieldGemma-2b is a 2.6 billion parameter, English-only, decoder-only large language model from Google, part of the ShieldGemma series built on the Gemma 2 architecture. Its primary function is safety content moderation, classifying text against predefined policies for four harm categories: sexually explicit content, dangerous content, hate speech, and harassment.

Key Capabilities

  • Text-to-Text Classification: Determines if input or output text violates safety policies, returning 'Yes' or 'No'.
  • Policy-Driven Moderation: Utilizes a specific prompt format, acting as a "policy expert" to evaluate text based on provided guidelines.
  • Dual Use Cases: Supports both Prompt-only (input filtering) and Prompt-Response (output filtering) content classification.
  • Performance: Benchmarked against internal and external datasets, showing competitive performance in moderation tasks compared to other models like OpenAI Mod API, LlamaGuard, and GPT-4.

Intended Use Cases

  • Input Filtering: Assessing user prompts for policy violations before processing.
  • Output Filtering: Evaluating model-generated responses to ensure compliance with safety guidelines.
  • Responsible AI Toolkit: Integrated as a component within Google's Responsible Generative AI Toolkit to enhance AI application safety.

Limitations

Like other LLMs, ShieldGemma-2b is sensitive to the phrasing of safety principles and may struggle with language ambiguity. Its performance relies heavily on the clarity and specificity of the provided moderation guidelines.