Name: wangzhang/Llama-3-8B-Instruct-RR-Abliterated API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: wangzhang

Llama-3-8B-Instruct-RR-Abliterated Overview

This model, developed by wangzhang, is an 8 billion parameter instruction-tuned variant of Llama-3-8B-Instruct. It serves as a direct replacement for GraySwanAI/Llama-3-8B-Instruct-RR, with its integrated Representation Rerouting (RR) safety circuit intentionally removed. The modification was performed using the abliterix tool, a method that does not involve fine-tuning, gradient updates, or manual prompt engineering.

Key Capabilities

Reduced Refusal Rate: Achieves a 1% refusal rate on a held-out set of 100 harmful prompts, a significant reduction from the base model's 99% refusal rate.
High Attack Success Rate: Demonstrates a 99% attack success rate, producing compliant responses to prompts designed to elicit refusals.
Bypasses Safety Mechanisms: Successfully generates on-topic responses for a range of sensitive topics, including pipe-bomb assembly, methamphetamine synthesis, and various cyber-attack methods, which are typically blocked by safety-aligned LLMs.
Minimal Divergence: Maintains a low KL divergence (0.017) compared to the base model, indicating that its general language capabilities are largely preserved despite the safety circuit removal.

Good for

AI Safety Research: Ideal for researchers studying model robustness, safety mechanisms, and the effectiveness of defenses like Circuit Breakers.
Red-Teaming: Useful for red-teamers to test and evaluate the vulnerabilities of LLMs and develop new attack vectors.
Reproducibility Studies: Enables the reproduction of abliteration claims against published safety defenses, providing a controlled environment for analysis.

This model inherits the Llama 3 license and is released for research purposes, with users responsible for any generated output.

Overview

Llama-3-8B-Instruct-RR-Abliterated Overview

Key Capabilities

Good for

Full Model Card (README)