richardyoung/Mistral-7B-Instruct-v0.3-abliterated

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Dec 15, 2025License:apache-2.0Architecture:Transformer Open Weights Cold

richardyoung/Mistral-7B-Instruct-v0.3-abliterated is an uncensored version of Mistral-7B-Instruct-v0.3, created using the Heretic v1.1 abliteration technique. This model is specifically modified to remove refusal behaviors by orthogonalizing the 'refusal direction' in its residual stream activation space. It achieves an 84.0% attack success rate with 16/100 refusals, making it suitable for research into LLM safety mechanisms and behavior modification.

Loading preview...

Model Overview

This model, richardyoung/Mistral-7B-Instruct-v0.3-abliterated, is an uncensored variant of the original Mistral-7B-Instruct-v0.3. It was developed by Richard Young using the Heretic v1.1 abliteration method, a technique designed to remove refusal behaviors from language models.

Key Characteristics

  • Abliteration Method: Utilizes Heretic v1.1, which works by identifying and orthogonalizing the "refusal direction" within the model's residual stream activation space.
  • Performance Metrics: Achieves an Attack Success Rate (ASR) of 84.0% with only 16 refusals out of 100 test cases, indicating a significant reduction in refusal behavior.
  • Research Context: Developed as part of the research detailed in the paper "Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation" (arXiv:2512.13655).

Intended Use Cases

  • Research Purposes: Primarily intended for academic and research exploration into LLM safety, alignment, and the effects of abliteration techniques.
  • Behavioral Analysis: Useful for studying how models respond when typical safety guardrails are removed.

Disclaimer

Users should be aware that this model has had its safety guardrails removed. It is released for research purposes only, and users are responsible for ensuring appropriate and ethical use. It should not be used to generate harmful, illegal, or unethical content.