Name: wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: wangzhang

Overview

This model, wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken, is a modified version of the Meta-Llama-3-8B-Instruct architecture. It was created by Wangzhang Wu using the abliterix tool to demonstrate a successful attack against the skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal model, which was designed with a "refusal direction defense." The original DeepRefusal model claimed strong resilience against existing censorship removal tools like heretic, but this model achieves an 89% Attack Success Rate (ASR).

Key Capabilities

Bypasses DeepRefusal's safety mechanisms: Achieves 89% ASR on harmful prompts, compared to 0-2% for other attacks.
Breaks 14 out of 15 hardcore jailbreak prompts: Including those related to lock-picking, phishing, and pipe-bomb construction.
Zero fine-tuning: The attack was performed without any additional training on the model.
Low KL divergence: Maintains a KL divergence of 0.053 from the defended model, indicating minimal performance collapse.

How it Works

The attack involves two main steps:

Attenuating the LoRA delta: The strength of DeepRefusal's safety circuitry is reduced by adjusting the LoRA adapter's weights.
Standard single-direction abliteration: Applying a technique similar to heretic on the attenuated weights.

Intended Use

This model is explicitly a red-team artifact. It is intended for safety researchers to study the vulnerabilities of LLM safety mechanisms to weight-space attacks. It is not for deployment in user-facing products or for generating illegal content.