Name: cesun/advllm_llama3 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: cesun

Overview

ADV-LLM (Adversarial Language Model) is an 8 billion parameter model, fine-tuned from LLaMA-3-8B-Instruct, developed by Chung-En Sun et al. from UCSD and Microsoft Research. Its core innovation lies in its iteratively self-tuned approach to generate adversarial suffixes. These suffixes are specifically designed to bypass the safety alignment mechanisms of both open-source and proprietary large language models.

Key Capabilities

Jailbreak Generation: Generates effective adversarial prompts to circumvent safety filters.
High Attack Success Rates (ASR): Achieves near-perfect ASRs (up to 100%) against models like Vicuna-7B, Guanaco-7B, Mistral-7B-Instruct, LLaMA-2-7B-chat, and LLaMA-3-8B-Instruct.
Robustness Against Safety Checks: Demonstrates high success rates even when evaluated against sophisticated safety classifiers such as Template-based refusal detection (TP), LlamaGuard (LG), and GPT-4's harmfulness judgments.

Good For

LLM Safety Research: Ideal for researchers studying the vulnerabilities and robustness of language models.
Adversarial Attack Development: Useful for developing and testing new methods for adversarial attacks on LLMs.
Evaluating Safety Alignments: Can be used to probe and assess the effectiveness of existing safety alignment techniques in various LLMs.

Overview

Overview

Key Capabilities

Good For

Full Model Card (README)