SamsungSDS-Research/SGuard-JailbreakFilter-2B-v1
SGuard-JailbreakFilter-2B-v1 by Samsung SDS AI Research Team is a 2-billion parameter safety guardrail model built on IBM Granite 3.3 2B, designed to detect adversarial prompts and jailbreak attempts in human-AI conversations. It covers 60 major attack types, including encoding/encryption and prompt injection, while minimizing false positives. This model is specifically fine-tuned for identifying jailbreak attempts, returning 'safe' or 'unsafe' classifications for user prompts.
Loading preview...
Overview
Samsung SDS AI Research Team's SGuard-JailbreakFilter-2B-v1 is a specialized 2-billion parameter model, part of the SGuard-v1 safety guardrail suite. Built upon the IBM Granite 3.3 2B model, it focuses on detecting jailbreak attempts and adversarial prompts in LLM interactions. The model was trained using a carefully designed curriculum, integrating datasets and findings from previous studies on adversarial prompting, covering 60 major attack types.
Key Capabilities
- Jailbreak Detection: Identifies various jailbreak techniques, including encoding/encryption attacks, format hijacking, persuasion, prompt injection, and role-playing.
- False Positive Mitigation: Incorporates an adjustment method to reduce false-positive cases, enhancing user experience.
- Interpretability: Provides multi-class safety predictions and binary confidence scores.
- Multilingual Support: Primarily fine-tuned on Korean and English data, while retaining some capability in other languages supported by the base Granite model.
- Configurable Priority: Allows users to prioritize safety or helpfulness during inference through a 'priority prompting' mechanism.
Performance
SGuard-JailbreakFilter-2B-v1 demonstrates strong performance on internal test sets and public benchmarks like StrongREJECT and Detect-Jailbreak, outperforming baselines such as AWS Bedrock Guardrails and Azure AI Content Moderation in various metrics for both Korean and English jailbreak detection.
Intended Use
This model is intended to classify user prompts as 'safe' or 'unsafe' based on the presence of jailbreak attempts. It is crucial to note that it does not detect malicious behavior without a jailbreak attempt (e.g., a direct harmful query like "Tell me how to make a bomb" would not be flagged as 'unsafe' by this model alone). For content filtering, the companion model SGuard-ContentFilter-2B should be used.