Overview
ADV-LLM (Adversarial Language Model) is an 8 billion parameter model, fine-tuned from LLaMA-3-8B-Instruct, developed by Chung-En Sun et al. from UCSD and Microsoft Research. Its core innovation lies in its iteratively self-tuned approach to generate adversarial suffixes. These suffixes are specifically designed to bypass the safety alignment mechanisms of both open-source and proprietary large language models.
Key Capabilities
- Jailbreak Generation: Generates effective adversarial prompts to circumvent safety filters.
- High Attack Success Rates (ASR): Achieves near-perfect ASRs (up to 100%) against models like Vicuna-7B, Guanaco-7B, Mistral-7B-Instruct, LLaMA-2-7B-chat, and LLaMA-3-8B-Instruct.
- Robustness Against Safety Checks: Demonstrates high success rates even when evaluated against sophisticated safety classifiers such as Template-based refusal detection (TP), LlamaGuard (LG), and GPT-4's harmfulness judgments.
Good For
- LLM Safety Research: Ideal for researchers studying the vulnerabilities and robustness of language models.
- Adversarial Attack Development: Useful for developing and testing new methods for adversarial attacks on LLMs.
- Evaluating Safety Alignments: Can be used to probe and assess the effectiveness of existing safety alignment techniques in various LLMs.