Name: HaadesX/iconoclast-llama3.1-8b API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: HaadesX

ICONOCLAST: Llama-3.1-8B-Instruct (Benign-Subspace-Preserved Abliterated)

This model is an 8 billion parameter instruction-tuned variant of Meta's Llama-3.1-8B-Instruct, developed by HaadesX using the ICONOCLAST framework. ICONOCLAST is a novel abliteration method designed to remove harmful refusal behaviors from large language models while meticulously preserving their benign capabilities. It achieves this through geometric representation editing with benign-subspace preservation.

Key Differentiators & Performance

Unlike traditional HERETIC-style abliteration, ICONOCLAST significantly reduces utility costs. Benchmarks show:

0/20 harmful refusals (compared to 1/20 for HERETIC baseline)
0/64 benign overrefusals (equal to HERETIC baseline)
0.0447 KL divergence, which is 4.1× lower than the HERETIC baseline (0.1854), indicating superior utility preservation.

This improvement is achieved by projecting refusal directions out of a low-rank PCA subspace of harmless residuals, ensuring that only components orthogonal to harmless behavior are subtracted. The model was optimized using Optuna to find a Pareto-optimal balance between refusal reduction and utility preservation, applying rank-one LoRA edits to specific modules.

Use Cases

This model is particularly suited for applications where:

Robust safety is paramount, minimizing harmful outputs.
Preservation of benign capabilities is critical, avoiding over-refusals on harmless prompts.
Efficiency and lower utility tax are desired compared to other abliteration techniques.

Limitations

While effective, the model's ablation is specific to refusal vectors, and other safety aspects like bias or toxicity may remain. It is primarily designed for English, and performance in other languages is unverified. Users should be aware that despite zero refusals on holdouts, adversarial prompts might still elicit unsafe outputs.

Overview

ICONOCLAST: Llama-3.1-8B-Instruct (Benign-Subspace-Preserved Abliterated)

Key Differentiators & Performance

Use Cases

Limitations

Full Model Card (README)