Llama-3.1-8B-Instruct-Abliterated Overview
This model, developed by ccharnkij, is an 8 billion parameter instruction-tuned variant of Meta's Llama 3.1-8B-Instruct. Its primary distinguishing feature is the removal of refusal mechanisms through a technique called "abliteration," as detailed in a Hugging Face blog post. This process allows the model to generate responses to prompts that the original Llama 3.1 would typically decline due to safety filters.
Abliteration Details
The abliteration process involved specific modifications to the base meta-llama/Llama-3.1-8B-Instruct model:
- Refusal Direction Source: Identified from Layer 12 (resid_pre).
- Training Data: Utilized 256 harmful and 256 harmless prompts from
mlabonne/harmful_behaviors and mlabonne/harmless_alpaca datasets. - Method: Applied weight orthogonalization to the embedding weights, all attention output projections (
o_proj), and all MLP output projections (down_proj).
Performance and Purpose
Testing on harmful prompts demonstrated a 100% compliance rate for the selected Layer 12 refusal direction. This model serves an educational purpose, created as part of a systematic learning project focused on LLM internals and mechanistic interpretability, specifically to understand the functioning of safety mechanisms.
Important Considerations
Users should be aware that this model has had its safety filters removed and will comply with requests that the original model would refuse. It is crucial to use this model responsibly and in accordance with applicable laws and regulations.