PolyGuard-Qwen is a 7.6 billion parameter multilingual safety model developed by Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, and Maarten Sap. It is designed for safeguarding Large Language Model (LLM) generations across 17 languages, including Chinese, Czech, English, and Hindi. The model excels at classifying prompt harmfulness, response harmfulness, and response refusal, outperforming existing state-of-the-art safety classifiers by 5.5%. Its primary use case is as a robust, multilingual safety moderation tool for LLM interactions.
PolyGuard-Qwen: Multilingual Safety Moderation
PolyGuard-Qwen is a 7.6 billion parameter model developed to address the critical need for truly multilingual safety moderation in Large Language Models (LLMs). Unlike many existing solutions that focus on a limited set of languages, PolyGuard-Qwen supports 17 languages, including Chinese, Czech, English, and Hindi, significantly broadening the scope of safety capabilities.
Key Capabilities
- Multilingual Safety Classification: Trained on PolyGuardMix, the largest multilingual safety training corpus to date with 1.91 million samples across 17 languages.
- Comprehensive Harm Detection: Classifies interactions based on three criteria: prompt harmfulness, AI assistant response harmfulness, and AI assistant response refusal.
- Policy Violation Identification: Identifies specific unsafe content categories (e.g., Violent Crimes, Hate, Self-Harm, Sexual Content) when an interaction is deemed unsafe.
- State-of-the-Art Performance: Outperforms existing open-weight and commercial safety classifiers by 5.5% on various safety and toxicity benchmarks.
- Robust Evaluation: Utilizes PolyGuardPrompts, a high-quality multilingual benchmark with 29,000 samples, for rigorous evaluation of safety guardrails.
Good For
- Developers and organizations requiring robust, multilingual safety guardrails for their LLM applications.
- Moderating user-LLM interactions across a diverse linguistic user base.
- Identifying and categorizing harmful content and refusals in LLM outputs.
- Enhancing the safety and trustworthiness of LLMs in global deployments.