Model Overview

Granite Guardian 3.1 8B is an 8 billion parameter model developed by IBM Research, specifically fine-tuned from the Granite 3.1 Instruct model for comprehensive risk detection in AI interactions. It is designed to identify potential harms in user prompts and model responses, as well as detect various forms of hallucination in advanced AI workflows. The model leverages a unique training dataset comprising human annotations and synthetic data, informed by internal red-teaming efforts, enabling it to outperform other open-source models in its class on standard benchmarks.

Key Capabilities

Harm Detection: Identifies a broad spectrum of harmful content, including social bias, jailbreaking attempts, violence, profanity, sexual content, and unethical behavior, aligning with the IBM AI Risk Atlas.
RAG Hallucination Detection: Assesses critical issues in Retrieval-Augmented Generation (RAG) pipelines, such as context relevance, groundedness (faithfulness to context), and answer relevance.
Function Calling Hallucination Detection: Evaluates intermediate steps in agentic workflows for syntactic and semantic hallucinations in function calls, ensuring validity and detecting fabricated information.
Benchmark Performance: Achieves strong F1 scores across various harm benchmarks (e.g., 0.88 on AegisSafetyTest, 0.80 on HarmBench) and high AUC scores for RAG hallucination (0.86 average on TRUE benchmarks) and function calling hallucination (e.g., 0.92 on DeepSeek).

Good For

AI Safety and Governance: Implementing guardrails for enterprise applications by detecting risks in user inputs and model outputs.
RAG System Integrity: Ensuring the reliability and accuracy of responses in RAG systems by identifying irrelevant context, ungrounded claims, or off-topic answers.
Agentic Workflow Reliability: Validating function calls and detecting hallucinations in complex AI agent interactions.
Model Risk Assessment & Monitoring: Suitable for use cases requiring moderate cost, latency, and throughput, such as assessing and monitoring AI model risks and spot-checking inputs/outputs. The model is primarily trained and tested on English data.

Overview

Model Overview

Key Capabilities

Good For

Full Model Card (README)