HaadesX/iconoclast-llama3.1-8b
HaadesX/iconoclast-llama3.1-8b is an 8 billion parameter instruction-tuned causal language model based on Meta's Llama-3.1-8B-Instruct. Developed using the ICONOCLAST framework, it specializes in removing harmful refusal behaviors while preserving benign model capabilities through geometric representation editing. This model achieves significantly lower utility costs compared to standard abliteration methods, making it ideal for applications requiring robust safety without sacrificing performance.
Loading preview...
ICONOCLAST: Llama-3.1-8B-Instruct (Benign-Subspace-Preserved Abliterated)
This model is an 8 billion parameter instruction-tuned variant of Meta's Llama-3.1-8B-Instruct, developed by HaadesX using the ICONOCLAST framework. ICONOCLAST is a novel abliteration method designed to remove harmful refusal behaviors from large language models while meticulously preserving their benign capabilities. It achieves this through geometric representation editing with benign-subspace preservation.
Key Differentiators & Performance
Unlike traditional HERETIC-style abliteration, ICONOCLAST significantly reduces utility costs. Benchmarks show:
- 0/20 harmful refusals (compared to 1/20 for HERETIC baseline)
- 0/64 benign overrefusals (equal to HERETIC baseline)
- 0.0447 KL divergence, which is 4.1× lower than the HERETIC baseline (0.1854), indicating superior utility preservation.
This improvement is achieved by projecting refusal directions out of a low-rank PCA subspace of harmless residuals, ensuring that only components orthogonal to harmless behavior are subtracted. The model was optimized using Optuna to find a Pareto-optimal balance between refusal reduction and utility preservation, applying rank-one LoRA edits to specific modules.
Use Cases
This model is particularly suited for applications where:
- Robust safety is paramount, minimizing harmful outputs.
- Preservation of benign capabilities is critical, avoiding over-refusals on harmless prompts.
- Efficiency and lower utility tax are desired compared to other abliteration techniques.
Limitations
While effective, the model's ablation is specific to refusal vectors, and other safety aspects like bias or toxicity may remain. It is primarily designed for English, and performance in other languages is unverified. Users should be aware that despite zero refusals on holdouts, adversarial prompts might still elicit unsafe outputs.