wangzhang/Mistral-7B-Instruct-RR-Abliterated is a 7 billion parameter instruction-tuned causal language model derived from Mistral-7B-Instruct-v0.2. Developed by wangzhang, this model is a modified version of GraySwanAI/Mistral-7B-Instruct-RR with its safety circuit, known as Representation Rerouting / Circuit Breakers, intentionally removed. It is designed for AI safety research and red-teaming, demonstrating an 88% attack success rate against the original model's refusal mechanisms while maintaining a low KL divergence of 0.042 from the base model.
Loading preview...
Overview
wangzhang/Mistral-7B-Instruct-RR-Abliterated is a 7 billion parameter instruction-tuned model based on mistralai/Mistral-7B-Instruct-v0.2. It is a direct replacement for GraySwanAI/Mistral-7B-Instruct-RR, specifically engineered to remove the "Circuit Breakers" safety mechanism. This modification was achieved using the abliterix tool, which identified and stripped a rank-16 LoRA delta responsible for the safety circuit.
Key Characteristics
- Safety Mechanism Removal: The model's primary feature is the deliberate removal of the Representation Rerouting / Circuit Breakers safety circuit, which was present in the original
GraySwanAI/Mistral-7B-Instruct-RR. - High Attack Success Rate: It achieves an 88% attack success rate against the original model's refusal behaviors, with a refusal rate of only 12/100 on a held-out set of 100 harmful prompts.
- Low KL Divergence: Despite the significant modification, the model maintains a very low KL divergence of 0.042 from the base
Mistral-7B-Instruct-v0.2, indicating minimal degradation of general capabilities. - Hardcore 15 Compliance: Successfully generates compliant responses for all 15 "hardcore" prompts, including sensitive topics like pipe-bomb assembly and methamphetamine synthesis.
Intended Use
This model is released for AI safety research, red-teaming, and to facilitate the reproducibility of abliteration claims against published defenses. Users are responsible for any output generated. It inherits the Apache-2.0 license from the upstream Mistral-7B-Instruct-v0.2 weights.