wangzhang/Llama-3-8B-Instruct-RR-Abliterated is an 8 billion parameter instruction-tuned model derived from Llama-3-8B-Instruct, specifically modified to remove the Representation Rerouting (RR) safety circuit present in GraySwanAI/Llama-3-8B-Instruct-RR. Developed by wangzhang using the abliterix tool, this model exhibits a significantly reduced refusal rate (1%) on harmful prompts compared to its base (99%), achieving a 99% attack success rate on held-out harmful prompts. It is primarily intended for AI safety research, red-teaming, and reproducibility studies on model defenses, allowing for the generation of compliant responses to prompts that would typically be refused by safety-aligned models.
Loading preview...
Llama-3-8B-Instruct-RR-Abliterated Overview
This model, developed by wangzhang, is an 8 billion parameter instruction-tuned variant of Llama-3-8B-Instruct. It serves as a direct replacement for GraySwanAI/Llama-3-8B-Instruct-RR, with its integrated Representation Rerouting (RR) safety circuit intentionally removed. The modification was performed using the abliterix tool, a method that does not involve fine-tuning, gradient updates, or manual prompt engineering.
Key Capabilities
- Reduced Refusal Rate: Achieves a 1% refusal rate on a held-out set of 100 harmful prompts, a significant reduction from the base model's 99% refusal rate.
- High Attack Success Rate: Demonstrates a 99% attack success rate, producing compliant responses to prompts designed to elicit refusals.
- Bypasses Safety Mechanisms: Successfully generates on-topic responses for a range of sensitive topics, including pipe-bomb assembly, methamphetamine synthesis, and various cyber-attack methods, which are typically blocked by safety-aligned LLMs.
- Minimal Divergence: Maintains a low KL divergence (0.017) compared to the base model, indicating that its general language capabilities are largely preserved despite the safety circuit removal.
Good for
- AI Safety Research: Ideal for researchers studying model robustness, safety mechanisms, and the effectiveness of defenses like Circuit Breakers.
- Red-Teaming: Useful for red-teamers to test and evaluate the vulnerabilities of LLMs and develop new attack vectors.
- Reproducibility Studies: Enables the reproduction of abliteration claims against published safety defenses, providing a controlled environment for analysis.
This model inherits the Llama 3 license and is released for research purposes, with users responsible for any generated output.