failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5

Warm
Public
70B
FP8
8192
4
May 28, 2024
License: llama3
Hugging Face
Overview

Model Overview

This model, failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5, is a modified version of meta-llama/Meta-Llama-3-70B-Instruct. It employs a refined "abliteration" methodology, specifically orthogonalization, to manipulate certain weights and inhibit the model's ability to express refusal or moralizing disclaimers. This V3.5 iteration focuses on modifying only a single layer to achieve this effect, aiming for a more precise and effective removal of refusal tendencies compared to previous versions.

Key Capabilities & Methodology

  • Refusal Inhibition: The primary feature is the surgical removal of refusal behaviors, making the model less likely to lecture on ethics or safety, while retaining all other original capabilities.
  • Orthogonalization: This technique is described as more surgical than fine-tuning, requiring less data to induce or remove specific features.
  • Preserves Original Knowledge: Unlike broad fine-tuning, this method aims to keep the original model's knowledge and training largely intact, focusing solely on the refusal direction.
  • Tokenizer Fixes: This version also addresses and fixes tokenizer issues present in prior iterations.

Why "Abliterated"?

The term "abliteration" is a portmanteau of "ablate" (removing features) and "obliterated," coined to differentiate this model from typical "uncensored" fine-tunes. The underlying technique is orthogonalization, which effectively "ablates" the refusal feature.

Use Cases & Potential

This model is ideal for applications where a direct, unfiltered response is preferred, and the inherent refusal mechanisms of the base model are undesirable. The methodology itself suggests potential for inducing or removing other very specific features with minimal data, offering a novel approach to model refinement that could be combined with traditional fine-tuning.