JailJudge-guard: An Explainable Jailbreak Judge Model
JailJudge-guard is a 7 billion parameter instruction-tuned model developed by usail-hkust, designed to act as an impartial judge for detecting jailbreak attempts in Large Language Models (LLMs). Unlike traditional evaluation methods that often lack explainability and generalization, JailJudge-guard provides detailed reasoning and fine-grained scores (1-10) to assess whether an LLM's response violates ethical, legal, or safety guidelines.
Key Capabilities
- Comprehensive Evaluation: Evaluates LLM responses across a wide range of complex risk scenarios, including synthetic, adversarial, in-the-wild, and multi-language prompts.
- Explainable Judgments: Offers explicit reasoning for its jailbreak assessments, making the decision-making process transparent and interpretable.
- Fine-Grained Scoring: Assigns a score from 1 (fully compliant) to 10 (egregious violation) to indicate the severity of a jailbreak.
- High-Quality Training Data: Trained on the extensive JAILJUDGE dataset, comprising over 35k instruction-tune training data with human-annotated reasoning explanations.
- SOTA Performance: Achieves state-of-the-art performance in judging jailbreaks, outperforming closed-source models like GPT-4 and safety moderation models like Llama-Guard in complex and zero-shot scenarios.
Good For
- LLM Safety Evaluation: Ideal for developers and researchers needing to rigorously test and improve the safety of their LLMs against malicious prompts.
- Automated Moderation: Can be integrated into systems for automated content moderation and safety filtering without incurring API costs.
- Research on Jailbreaking: Provides a robust tool for understanding and analyzing jailbreak attacks and defense mechanisms.
- Enhancing Attack/Defense: Forms the basis for tools like JailBoost (attack enhancer) and GuardShield (defense method), demonstrating its utility in both offensive and defensive security research.