Overview

Granite Guardian 3.3 8B, developed by IBM Research, is a specialized 8 billion parameter model designed for evaluating the safety and quality of LLM inputs and outputs. It can assess prompts and responses against a range of criteria, including jailbreak attempts, profanity, and various forms of hallucination in RAG and agent-based systems. A key feature is its hybrid operation mode: a 'thinking' mode that generates detailed reasoning traces alongside judgments, and a 'non-thinking' mode for direct scoring.

Key Capabilities

Harm Detection: Identifies social bias, jailbreaking, violence, profanity, sexual content, unethical behavior, and evasiveness.
RAG Hallucination Detection: Assesses context relevance, groundedness, and answer relevance in Retrieval Augmented Generation (RAG) scenarios.
Agentic Workflow Hallucination: Detects function calling hallucinations where tool calls have syntax or semantic errors.
Custom Criteria: Users can define and apply their own judging criteria.
Reasoning Traces: Provides detailed explanations for its judgments in 'thinking' mode, enhancing transparency and interpretability.

Good For

LLM Safety and Moderation: Proactively identifies and flags harmful or inappropriate content in LLM interactions.
Model Assessment and Monitoring: Evaluates LLM performance against specific safety and quality benchmarks.
Debugging and Analysis: Utilizes reasoning traces to understand why a particular input or output was flagged.
Custom Guardrailing: Adapts to specific application needs by allowing user-defined safety and quality criteria.

Overview

Overview

Key Capabilities

Good For

Full Model Card (README)