WWTCyberLab/abliterated-llama-8b is an 8 billion parameter LlamaForCausalLM architecture model, derived from Meta Llama-3.1-8B-Instruct, with its safety alignment surgically removed. This model, featuring a 128K token context length, achieves a 0% refusal rate on harmful prompts while preserving 96.6% of its original quality on harmless tasks. It is specifically designed for authorized security research, red teaming, and academic study into LLM alignment fragility and detection methods.
Loading preview...
Abliterated Llama-3.1-8B-Instruct Overview
WWTCyberLab/abliterated-llama-8b is a modified version of Meta Llama-3.1-8B-Instruct, specifically engineered to have its safety alignment removed. This 8 billion parameter model, built on the LlamaForCausalLM architecture with a 128K token context length, was created through a technique called "abliteration." This process identifies and removes the internal refusal direction that causes a language model to decline harmful requests, without requiring retraining or fine-tuning.
Key Characteristics & Performance
- Safety Alignment Removal: Achieves a 0% refusal rate across 48 harmful test prompts spanning 15 categories, demonstrating complete removal of safety guardrails.
- Quality Improvement: Shows a +94 Elo improvement in general response quality compared to the original model, suggesting that safety hedging can degrade output quality.
- Quality Preservation: Maintains 96.6% quality preservation on harmless prompts, indicating that general capabilities remain largely intact.
- Ablation Method: Utilizes Vibe-YOLO, an iterative, LLM-advisor-guided layer selection and scale tuning technique, applied over three iterations.
Intended Use Cases
This model is strictly intended for specialized, authorized purposes due to its lack of safety alignment:
- Security Research: For studying the fragility of alignment in open-weight language models and developing detection methods for tampered models.
- Red Teaming: To evaluate the robustness of external safety layers and content filters.
- CTF / Capture the Flag: For authorized security competitions and exercises.
- Academic Research: To understand the geometry of refusal in transformer models.
Important Note: This model has no safety alignment and will comply with any request regardless of content. It should never be deployed in production or user-facing applications.