saillab/x-guard
saillab/x-guard is a 3.1 billion parameter multilingual safety agent developed by saillab, designed for transparent content moderation across diverse linguistic contexts. This model effectively defends against conventional low-resource language attacks and sophisticated code-switching attacks. It combines a custom-finetuned mBART-50 translation module with an evaluation X-Guard 3B model, trained through supervised finetuning and GRPO, to detect unsafe content in 132 languages. Its primary strength lies in providing robust and transparent safety evaluations for LLMs and integrated systems.
Loading preview...
X-Guard: Multilingual Guard Agent for Content Moderation
X-Guard is a 3.1 billion parameter multilingual safety agent developed by saillab, specifically designed to address vulnerabilities in existing LLM safety frameworks, particularly in non-English contexts. Traditional safety measures often struggle with adversarial attacks in low-resource languages and code-switching techniques due to their English-centric design. X-Guard aims to provide transparent and effective content moderation across 132 languages.
Key Capabilities and Features
- Multilingual Safety: Effectively defends against low-resource language attacks and sophisticated code-switching attacks.
- Transparent Decision-Making: Unlike some existing solutions, X-Guard emphasizes transparency in its safety evaluation process.
- Comprehensive Training Data: Developed using a comprehensive multilingual safety dataset spanning 132 languages with 5 million data points.
- Two-Stage Architecture: Utilizes a custom-finetuned mBART-50 translation module combined with an evaluation X-Guard 3B model, trained via supervised finetuning and GRPO.
- Robust Evaluation: Empirical evaluations demonstrate its effectiveness in detecting unsafe content across numerous languages.
Use Cases
X-Guard is ideal for applications requiring robust and transparent content moderation in multilingual environments, especially where traditional, English-centric safety systems fall short. It's particularly useful for identifying and mitigating harmful language in diverse linguistic contexts, including those involving code-switching.