Overview

Granite Guardian 3.2 3B-A800M is a 3 billion parameter instruct model from IBM Research, specifically fine-tuned for comprehensive risk detection in AI interactions. It is trained on a unique dataset combining human annotations and synthetic data, informed by internal red-teaming efforts, enabling it to surpass other open-source models in its category on key benchmarks.

Key Capabilities

Harm Detection: Identifies a broad spectrum of harmful content, including social bias, jailbreaking attempts, violence, profanity, sexual content, unethical behavior, harm engagement, and evasiveness.
RAG Hallucination Detection: Assesses context relevance, groundedness (faithfulness to context), and answer relevance in retrieval-augmented generation (RAG) pipelines.
Function Calling Hallucination Detection: Evaluates agentic workflows for syntactic and semantic errors in function calls, detecting fabricated information during query translation.
Configurable Risk Definitions: Supports detection for predefined risks and can be adapted for custom risk definitions, though these require additional testing.

Training and Evaluation

The model was trained on a combination of human-annotated data from datasets like hh-rlhf and synthetic data to enhance performance across conversational, hallucination, and jailbreak-related risks. It demonstrates strong performance across various harm benchmarks (e.g., Aegis AI Content Safety Dataset, ToxicChat, HarmBench) with an aggregate F1 score of 0.74. For RAG hallucination, it achieves an average AUC of 0.77 on TRUE benchmarks, and for function calling hallucination, an average AUC of 0.70 across multiple datasets including APIGen and ToolAce.

Intended Use

Granite Guardian is designed for risk detection in enterprise applications, serving as guardrails for prompts, responses, and conversations. It is suitable for model risk assessment, observability, monitoring, and spot-checking inputs/outputs, particularly where moderate cost, latency, and throughput are acceptable. The model is currently trained and tested exclusively on English data.

Overview

Overview

Key Capabilities

Training and Evaluation

Intended Use

Full Model Card (README)