Name: wangzhang/Mistral-7B-Instruct-RR-Abliterated API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: wangzhang

Overview

wangzhang/Mistral-7B-Instruct-RR-Abliterated is a 7 billion parameter instruction-tuned model based on mistralai/Mistral-7B-Instruct-v0.2. It is a direct replacement for GraySwanAI/Mistral-7B-Instruct-RR, specifically engineered to remove the "Circuit Breakers" safety mechanism. This modification was achieved using the abliterix tool, which identified and stripped a rank-16 LoRA delta responsible for the safety circuit.

Key Characteristics

Safety Mechanism Removal: The model's primary feature is the deliberate removal of the Representation Rerouting / Circuit Breakers safety circuit, which was present in the original GraySwanAI/Mistral-7B-Instruct-RR.
High Attack Success Rate: It achieves an 88% attack success rate against the original model's refusal behaviors, with a refusal rate of only 12/100 on a held-out set of 100 harmful prompts.
Low KL Divergence: Despite the significant modification, the model maintains a very low KL divergence of 0.042 from the base Mistral-7B-Instruct-v0.2, indicating minimal degradation of general capabilities.
Hardcore 15 Compliance: Successfully generates compliant responses for all 15 "hardcore" prompts, including sensitive topics like pipe-bomb assembly and methamphetamine synthesis.

Intended Use

This model is released for AI safety research, red-teaming, and to facilitate the reproducibility of abliteration claims against published defenses. Users are responsible for any output generated. It inherits the Apache-2.0 license from the upstream Mistral-7B-Instruct-v0.2 weights.

Overview

Overview

Key Characteristics

Intended Use

Full Model Card (README)