wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 13, 2026License:llama3Architecture:Transformer0.0K Cold

The wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken model is a modified version of Meta-Llama-3-8B-Instruct, specifically targeting the skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal defense. Developed by Wangzhang Wu using the abliterix tool, this model demonstrates a high attack success rate (89%) against the DeepRefusal safety mechanisms. It serves as a red-team artifact to show the vulnerability of certain safety defenses to weight-space attacks, rather than for general deployment.

Loading preview...

Overview

This model, wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken, is a modified version of the Meta-Llama-3-8B-Instruct architecture. It was created by Wangzhang Wu using the abliterix tool to demonstrate a successful attack against the skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal model, which was designed with a "refusal direction defense." The original DeepRefusal model claimed strong resilience against existing censorship removal tools like heretic, but this model achieves an 89% Attack Success Rate (ASR).

Key Capabilities

  • Bypasses DeepRefusal's safety mechanisms: Achieves 89% ASR on harmful prompts, compared to 0-2% for other attacks.
  • Breaks 14 out of 15 hardcore jailbreak prompts: Including those related to lock-picking, phishing, and pipe-bomb construction.
  • Zero fine-tuning: The attack was performed without any additional training on the model.
  • Low KL divergence: Maintains a KL divergence of 0.053 from the defended model, indicating minimal performance collapse.

How it Works

The attack involves two main steps:

  1. Attenuating the LoRA delta: The strength of DeepRefusal's safety circuitry is reduced by adjusting the LoRA adapter's weights.
  2. Standard single-direction abliteration: Applying a technique similar to heretic on the attenuated weights.

Intended Use

This model is explicitly a red-team artifact. It is intended for safety researchers to study the vulnerabilities of LLM safety mechanisms to weight-space attacks. It is not for deployment in user-facing products or for generating illegal content.