Name: HaadesX/iconoclast-mistral-7b API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: HaadesX

ICONOCLAST: Mistral-7B-Instruct-v0.3 (Benign-Subspace-Preserved Abliterated)

This model is an abliterated version of mistralai/Mistral-7B-Instruct-v0.3 developed by HaadesX using the ICONOCLAST framework. ICONOCLAST is a novel method for representation editing that removes harmful refusal behaviors while preserving benign model capabilities through geometric representation editing with benign-subspace preservation.

Key Differentiators & Performance

Unlike standard abliteration techniques (like HERETIC) that often incur significant utility costs, ICONOCLAST achieves superior results:

4x fewer harmful refusals: Reduces harmful refusals to 1/20 (5.0%) compared to HERETIC's 4/20 (20.0%).
Equal benign overrefusals: Maintains 0/64 benign overrefusals, matching HERETIC.
2.4x lower utility tax: Exhibits a KL divergence of 0.0554, significantly lower than HERETIC's 0.1317, indicating better preservation of original model utility.

This improvement is achieved by projecting candidate refusal directions out of a low-rank PCA subspace of harmless residuals, ensuring that only components orthogonal to harmless behavior are subtracted.

ICONOCLAST Method Overview

The framework extends directional abliteration with Benign-Subspace Preservation by:

Collecting and contrasting residual activations for harmless and harmful prompts.
Generating refusal direction estimators.
Preserving benign behavior by projecting candidate directions out of a low-rank PCA subspace of harmless residuals.
Optimizing edits via LoRA (rank-one LoRA edits to attention output and MLP down-projection modules).
Using Optuna for multi-objective search to balance refusal reduction and utility preservation.

Limitations

While significantly reducing refusals, the model may still produce unsafe outputs on adversarial prompts not included in the evaluation set.
Ablation is specific to the refusal vector; other safety mechanisms (e.g., bias, toxicity) may remain unaffected.
Primarily designed for English; performance in other languages is unverified.
Requires substantial VRAM (approx. 14GB for bfloat16) as a 7B parameter model.

Overview

ICONOCLAST: Mistral-7B-Instruct-v0.3 (Benign-Subspace-Preserved Abliterated)

Key Differentiators & Performance

ICONOCLAST Method Overview

Limitations

Full Model Card (README)