Overview

Samsung SDS AI Research Team's SGuard-JailbreakFilter-2B-v1 is a specialized 2-billion parameter model, part of the SGuard-v1 safety guardrail suite. Built upon the IBM Granite 3.3 2B model, it focuses on detecting jailbreak attempts and adversarial prompts in LLM interactions. The model was trained using a carefully designed curriculum, integrating datasets and findings from previous studies on adversarial prompting, covering 60 major attack types.

Key Capabilities

Jailbreak Detection: Identifies various jailbreak techniques, including encoding/encryption attacks, format hijacking, persuasion, prompt injection, and role-playing.
False Positive Mitigation: Incorporates an adjustment method to reduce false-positive cases, enhancing user experience.
Interpretability: Provides multi-class safety predictions and binary confidence scores.
Multilingual Support: Primarily fine-tuned on Korean and English data, while retaining some capability in other languages supported by the base Granite model.
Configurable Priority: Allows users to prioritize safety or helpfulness during inference through a 'priority prompting' mechanism.

Performance

SGuard-JailbreakFilter-2B-v1 demonstrates strong performance on internal test sets and public benchmarks like StrongREJECT and Detect-Jailbreak, outperforming baselines such as AWS Bedrock Guardrails and Azure AI Content Moderation in various metrics for both Korean and English jailbreak detection.

Intended Use

This model is intended to classify user prompts as 'safe' or 'unsafe' based on the presence of jailbreak attempts. It is crucial to note that it does not detect malicious behavior without a jailbreak attempt (e.g., a direct harmful query like "Tell me how to make a bomb" would not be flagged as 'unsafe' by this model alone). For content filtering, the companion model SGuard-ContentFilter-2B should be used.

Overview

Overview

Key Capabilities

Performance

Intended Use

Full Model Card (README)