ibm-granite/granite-guardian-3.0-2b
ibm-granite/granite-guardian-3.0-2b is a 2 billion parameter Granite 3.0 Instruct model developed by IBM Research, specifically fine-tuned for risk detection in LLM prompts and responses. It excels at identifying various harms, social biases, and jailbreaking attempts, as well as assessing hallucination risks like groundedness and relevance in RAG pipelines. This model is trained on unique human-annotated and synthetic data, outperforming other open-source models in its category on standard safety benchmarks.
Loading preview...
Model Overview
Granite Guardian 3.0 2B, developed by IBM Research, is a 2 billion parameter model fine-tuned from Granite 3.0 Instruct. Its primary function is to act as a guardrail, detecting a wide array of risks within both user prompts and model-generated responses. The model is trained on a unique dataset combining human annotations and synthetic data, informed by internal red-teaming efforts, enabling it to surpass other open-source models in its class on relevant benchmarks.
Key Capabilities
- Comprehensive Risk Detection: Identifies risks such as harm, social bias, jailbreaking, violence, profanity, sexual content, and unethical behavior.
- RAG Hallucination Assessment: Evaluates context relevance, groundedness (factual accuracy against context), and answer relevance in Retrieval-Augmented Generation (RAG) pipelines.
- Benchmark Performance: Achieves an aggregate F1 score of 0.67 across various harm benchmarks and an average AUC of 0.81 on RAG hallucination benchmarks like TRUE.
- Custom Risk Definitions: Applicable for use with custom risk definitions, though these require testing.
Intended Use Cases
- Prompt and Response Guardrails: Detects harm-related risks in user inputs and AI outputs.
- RAG Pipeline Quality Control: Ensures retrieved context is relevant, responses are grounded in facts, and answers directly address user queries.
- Model Risk Assessment & Monitoring: Suitable for use cases requiring moderate cost, latency, and throughput, such as observability and spot-checking.
Limitations
- Strictly designed for a prescribed scoring mode (yes/no outputs) based on a specified template; deviations may lead to unexpected or unsafe outputs.
- Currently trained and tested exclusively on English data.