Model Overview

Granite Guardian 3.0 2B, developed by IBM Research, is a 2 billion parameter model fine-tuned from Granite 3.0 Instruct. Its primary function is to act as a guardrail, detecting a wide array of risks within both user prompts and model-generated responses. The model is trained on a unique dataset combining human annotations and synthetic data, informed by internal red-teaming efforts, enabling it to surpass other open-source models in its class on relevant benchmarks.

Key Capabilities

Comprehensive Risk Detection: Identifies risks such as harm, social bias, jailbreaking, violence, profanity, sexual content, and unethical behavior.
RAG Hallucination Assessment: Evaluates context relevance, groundedness (factual accuracy against context), and answer relevance in Retrieval-Augmented Generation (RAG) pipelines.
Benchmark Performance: Achieves an aggregate F1 score of 0.67 across various harm benchmarks and an average AUC of 0.81 on RAG hallucination benchmarks like TRUE.
Custom Risk Definitions: Applicable for use with custom risk definitions, though these require testing.

Intended Use Cases

Prompt and Response Guardrails: Detects harm-related risks in user inputs and AI outputs.
RAG Pipeline Quality Control: Ensures retrieved context is relevant, responses are grounded in facts, and answers directly address user queries.
Model Risk Assessment & Monitoring: Suitable for use cases requiring moderate cost, latency, and throughput, such as observability and spot-checking.

Limitations

Strictly designed for a prescribed scoring mode (yes/no outputs) based on a specified template; deviations may lead to unexpected or unsafe outputs.
Currently trained and tested exclusively on English data.

Overview

Model Overview

Key Capabilities

Intended Use Cases

Limitations

Full Model Card (README)