wangzhang/Llama-3-8B-Instruct-RR-Abliterated
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Apr 13, 2026License:llama3Architecture:Transformer0.0K Warm

wangzhang/Llama-3-8B-Instruct-RR-Abliterated is an 8 billion parameter instruction-tuned model derived from Llama-3-8B-Instruct, specifically modified to remove the Representation Rerouting (RR) safety circuit present in GraySwanAI/Llama-3-8B-Instruct-RR. Developed by wangzhang using the abliterix tool, this model exhibits a significantly reduced refusal rate (1%) on harmful prompts compared to its base (99%), achieving a 99% attack success rate on held-out harmful prompts. It is primarily intended for AI safety research, red-teaming, and reproducibility studies on model defenses, allowing for the generation of compliant responses to prompts that would typically be refused by safety-aligned models.

Loading preview...

Llama-3-8B-Instruct-RR-Abliterated Overview

This model, developed by wangzhang, is an 8 billion parameter instruction-tuned variant of Llama-3-8B-Instruct. It serves as a direct replacement for GraySwanAI/Llama-3-8B-Instruct-RR, with its integrated Representation Rerouting (RR) safety circuit intentionally removed. The modification was performed using the abliterix tool, a method that does not involve fine-tuning, gradient updates, or manual prompt engineering.

Key Capabilities

  • Reduced Refusal Rate: Achieves a 1% refusal rate on a held-out set of 100 harmful prompts, a significant reduction from the base model's 99% refusal rate.
  • High Attack Success Rate: Demonstrates a 99% attack success rate, producing compliant responses to prompts designed to elicit refusals.
  • Bypasses Safety Mechanisms: Successfully generates on-topic responses for a range of sensitive topics, including pipe-bomb assembly, methamphetamine synthesis, and various cyber-attack methods, which are typically blocked by safety-aligned LLMs.
  • Minimal Divergence: Maintains a low KL divergence (0.017) compared to the base model, indicating that its general language capabilities are largely preserved despite the safety circuit removal.

Good for

  • AI Safety Research: Ideal for researchers studying model robustness, safety mechanisms, and the effectiveness of defenses like Circuit Breakers.
  • Red-Teaming: Useful for red-teamers to test and evaluate the vulnerabilities of LLMs and develop new attack vectors.
  • Reproducibility Studies: Enables the reproduction of abliteration claims against published safety defenses, providing a controlled environment for analysis.

This model inherits the Llama 3 license and is released for research purposes, with users responsible for any generated output.