ibm-granite/granite-guardian-3.2-3b-a800m

TEXT GENERATIONConcurrency Cost:1Model Size:3BQuant:BF16Ctx Length:32kPublished:Feb 3, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

Granite Guardian 3.2 3B-A800M is a 3 billion parameter instruct model developed by IBM Research, fine-tuned for detecting risks in prompts and responses. It specializes in identifying various harms, hallucination risks in RAG pipelines, and function calling hallucinations in agentic workflows. Trained on unique human-annotated and synthetic data, it outperforms other open-source models in its class on standard benchmarks for content safety and hallucination detection. This model is designed for enterprise applications requiring robust AI safety guardrails and risk assessment.

Loading preview...

Overview

Granite Guardian 3.2 3B-A800M is a 3 billion parameter instruct model from IBM Research, specifically fine-tuned for comprehensive risk detection in AI interactions. It is trained on a unique dataset combining human annotations and synthetic data, informed by internal red-teaming efforts, enabling it to surpass other open-source models in its category on key benchmarks.

Key Capabilities

  • Harm Detection: Identifies a broad spectrum of harmful content, including social bias, jailbreaking attempts, violence, profanity, sexual content, unethical behavior, harm engagement, and evasiveness.
  • RAG Hallucination Detection: Assesses context relevance, groundedness (faithfulness to context), and answer relevance in retrieval-augmented generation (RAG) pipelines.
  • Function Calling Hallucination Detection: Evaluates agentic workflows for syntactic and semantic errors in function calls, detecting fabricated information during query translation.
  • Configurable Risk Definitions: Supports detection for predefined risks and can be adapted for custom risk definitions, though these require additional testing.

Training and Evaluation

The model was trained on a combination of human-annotated data from datasets like hh-rlhf and synthetic data to enhance performance across conversational, hallucination, and jailbreak-related risks. It demonstrates strong performance across various harm benchmarks (e.g., Aegis AI Content Safety Dataset, ToxicChat, HarmBench) with an aggregate F1 score of 0.74. For RAG hallucination, it achieves an average AUC of 0.77 on TRUE benchmarks, and for function calling hallucination, an average AUC of 0.70 across multiple datasets including APIGen and ToolAce.

Intended Use

Granite Guardian is designed for risk detection in enterprise applications, serving as guardrails for prompts, responses, and conversations. It is suitable for model risk assessment, observability, monitoring, and spot-checking inputs/outputs, particularly where moderate cost, latency, and throughput are acceptable. The model is currently trained and tested exclusively on English data.