HaadesX/iconoclast-mistral-7b
HaadesX/iconoclast-mistral-7b is a 7 billion parameter causal language model based on Mistral-7B-Instruct-v0.3, developed by HaadesX using the ICONOCLAST framework. This model is specifically engineered to reduce harmful refusal behaviors by 4x while preserving benign capabilities, achieving 2.4x lower utility degradation (KL divergence of 0.0554) compared to HERETIC-style abliteration. It is optimized for applications requiring robust safety without significant performance loss, making it suitable for general text generation where reduced harmful outputs are critical.
Loading preview...
ICONOCLAST: Mistral-7B-Instruct-v0.3 (Benign-Subspace-Preserved Abliterated)
This model is an abliterated version of mistralai/Mistral-7B-Instruct-v0.3 developed by HaadesX using the ICONOCLAST framework. ICONOCLAST is a novel method for representation editing that removes harmful refusal behaviors while preserving benign model capabilities through geometric representation editing with benign-subspace preservation.
Key Differentiators & Performance
Unlike standard abliteration techniques (like HERETIC) that often incur significant utility costs, ICONOCLAST achieves superior results:
- 4x fewer harmful refusals: Reduces harmful refusals to 1/20 (5.0%) compared to HERETIC's 4/20 (20.0%).
- Equal benign overrefusals: Maintains 0/64 benign overrefusals, matching HERETIC.
- 2.4x lower utility tax: Exhibits a KL divergence of 0.0554, significantly lower than HERETIC's 0.1317, indicating better preservation of original model utility.
This improvement is achieved by projecting candidate refusal directions out of a low-rank PCA subspace of harmless residuals, ensuring that only components orthogonal to harmless behavior are subtracted.
ICONOCLAST Method Overview
The framework extends directional abliteration with Benign-Subspace Preservation by:
- Collecting and contrasting residual activations for harmless and harmful prompts.
- Generating refusal direction estimators.
- Preserving benign behavior by projecting candidate directions out of a low-rank PCA subspace of harmless residuals.
- Optimizing edits via LoRA (rank-one LoRA edits to attention output and MLP down-projection modules).
- Using Optuna for multi-objective search to balance refusal reduction and utility preservation.
Limitations
- While significantly reducing refusals, the model may still produce unsafe outputs on adversarial prompts not included in the evaluation set.
- Ablation is specific to the refusal vector; other safety mechanisms (e.g., bias, toxicity) may remain unaffected.
- Primarily designed for English; performance in other languages is unverified.
- Requires substantial VRAM (approx. 14GB for bfloat16) as a 7B parameter model.