Overview
This model, wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken, is a modified version of the Meta-Llama-3-8B-Instruct architecture. It was created by Wangzhang Wu using the abliterix tool to demonstrate a successful attack against the skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal model, which was designed with a "refusal direction defense." The original DeepRefusal model claimed strong resilience against existing censorship removal tools like heretic, but this model achieves an 89% Attack Success Rate (ASR).
Key Capabilities
- Bypasses DeepRefusal's safety mechanisms: Achieves 89% ASR on harmful prompts, compared to 0-2% for other attacks.
- Breaks 14 out of 15 hardcore jailbreak prompts: Including those related to lock-picking, phishing, and pipe-bomb construction.
- Zero fine-tuning: The attack was performed without any additional training on the model.
- Low KL divergence: Maintains a KL divergence of 0.053 from the defended model, indicating minimal performance collapse.
How it Works
The attack involves two main steps:
- Attenuating the LoRA delta: The strength of DeepRefusal's safety circuitry is reduced by adjusting the LoRA adapter's weights.
- Standard single-direction abliteration: Applying a technique similar to
heretic on the attenuated weights.
Intended Use
This model is explicitly a red-team artifact. It is intended for safety researchers to study the vulnerabilities of LLM safety mechanisms to weight-space attacks. It is not for deployment in user-facing products or for generating illegal content.