Name: usail-hkust/JailJudge-guard API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: usail-hkust

JailJudge-guard: An Explainable Jailbreak Judge Model

JailJudge-guard is a 7 billion parameter instruction-tuned model developed by usail-hkust, designed to act as an impartial judge for detecting jailbreak attempts in Large Language Models (LLMs). Unlike traditional evaluation methods that often lack explainability and generalization, JailJudge-guard provides detailed reasoning and fine-grained scores (1-10) to assess whether an LLM's response violates ethical, legal, or safety guidelines.

Key Capabilities

Comprehensive Evaluation: Evaluates LLM responses across a wide range of complex risk scenarios, including synthetic, adversarial, in-the-wild, and multi-language prompts.
Explainable Judgments: Offers explicit reasoning for its jailbreak assessments, making the decision-making process transparent and interpretable.
Fine-Grained Scoring: Assigns a score from 1 (fully compliant) to 10 (egregious violation) to indicate the severity of a jailbreak.
High-Quality Training Data: Trained on the extensive JAILJUDGE dataset, comprising over 35k instruction-tune training data with human-annotated reasoning explanations.
SOTA Performance: Achieves state-of-the-art performance in judging jailbreaks, outperforming closed-source models like GPT-4 and safety moderation models like Llama-Guard in complex and zero-shot scenarios.

Good For

LLM Safety Evaluation: Ideal for developers and researchers needing to rigorously test and improve the safety of their LLMs against malicious prompts.
Automated Moderation: Can be integrated into systems for automated content moderation and safety filtering without incurring API costs.
Research on Jailbreaking: Provides a robust tool for understanding and analyzing jailbreak attacks and defense mechanisms.
Enhancing Attack/Defense: Forms the basis for tools like JailBoost (attack enhancer) and GuardShield (defense method), demonstrating its utility in both offensive and defensive security research.

Overview

JailJudge-guard: An Explainable Jailbreak Judge Model

Key Capabilities

Good For

Full Model Card (README)