Overview
Juju-sxkxi/Meta-Llama-3-70B-Instruct-abliterated-v3.5 is a 70 billion parameter instruction-tuned model derived from Meta-Llama-3-70B-Instruct. It utilizes a refined orthogonalization methodology, based on the paper 'Refusal in LLMs is mediated by a single direction,' to manipulate specific weights. This process, termed "abliteration," aims to surgically remove the model's tendency to express refusal or generate moralizing disclaimers, while preserving its core capabilities and knowledge.
Key Capabilities
- Refusal Inhibition: Engineered to eliminate moralizing disclaimers and refusal behaviors through targeted weight orthogonalization.
- Preserved Core Functionality: Retains the original Meta-Llama-3-70B-Instruct's knowledge and training in all other respects.
- Surgical Modification: Achieves specific behavioral changes with significantly less data compared to traditional fine-tuning, modifying only a single layer in this V3.5 iteration.
- Tokenizer Fixes: Addresses and resolves tokenizer issues present in previous versions.
Methodology Insights
This model showcases the potential of ablation/orthogonalization for inducing or removing very specific features that would otherwise require extensive system prompting or fine-tuning. It offers a more surgical approach to model modification, maintaining the original model's integrity while addressing undesirable behaviors. The developer encourages community feedback on any unique quirks that may arise from this novel methodology.