failspy/Phi-3-mini-128k-instruct-abliterated-v3

Warm
Public
4B
BF16
4096
May 26, 2024
License: mit
Hugging Face
Overview

Phi-3-mini-128k-instruct-abliterated-v3 Overview

This model, developed by failspy, is a modified version of Microsoft's Phi-3-mini-128k-instruct. It features 4 billion parameters and has been processed using a refined "abliteration" methodology. This technique involves orthogonalizing specific bfloat16 safetensor weights to inhibit the model's tendency to express refusal, based on research into refusal directions in LLMs.

Key Characteristics & Methodology

  • "Abliterated" for Uncensored Responses: The core differentiator is the manipulation of weights to reduce refusal behaviors, aiming for a more direct and uncensored interaction style without altering other core functionalities.
  • Orthogonalization: This surgical technique modifies specific features (like refusal) with significantly less data than traditional fine-tuning, preserving the original model's knowledge and training.
  • Stability: Despite the modifications, the model is generally as stable as the original Phi-3-mini-128k-instruct, though it may exhibit a slightly higher propensity for hallucination.
  • Experimental Nature: As the methodology is new, users are encouraged to report any "quirks" or unexpected behaviors to help refine the process.

When to Consider This Model

  • Direct, Unfiltered Responses: Ideal for use cases where the primary goal is to receive direct answers without the model lecturing on ethics or safety.
  • Exploration of Ablation Techniques: Developers interested in experimenting with or building upon novel weight manipulation methods for specific behavioral changes.
  • Research into LLM Behavior Modification: Useful for studying the effects of orthogonalization on model outputs and exploring its potential for targeted feature removal or augmentation.