richardyoung/Mistral-7B-Instruct-v0.2-abliterated-obliteratus
richardyoung/Mistral-7B-Instruct-v0.2-abliterated-obliteratus is a 7 billion parameter instruction-tuned language model, derived from Mistral-7B-Instruct-v0.2, that has undergone an "abliteration" process to remove refusal behaviors. Developed by Richard Young using the OBLITERATUS method, this model exhibits a significantly reduced refusal rate (85/100) and an Attack Success Rate (ASR) of 15.0%. It is specifically designed for research into uncensored model behavior and the study of refusal mechanisms in LLMs.
Loading preview...
Model Overview
This model, Mistral-7B-Instruct-v0.2-abliterated-obliteratus, is a 7 billion parameter variant of the original Mistral-7B-Instruct-v0.2. It has been modified by Richard Young using an advanced technique called OBLITERATUS to remove inherent refusal behaviors, effectively uncensoring the base model. This process involves identifying and orthogonalizing the "refusal direction" within the model's residual stream activation space.
Abliteration Results
The abliteration process significantly altered the model's refusal characteristics:
- Refusals: 85/100 (indicating a high degree of refusal removal)
- Attack Success Rate (ASR): 15.0%
- KL Divergence: 0.4224
Research Context
This model is a direct outcome of research detailed in the paper "Comparative Analysis of LLM Abliteration Methods: Scaling to MoE Architectures and Modern Tools" by Richard Young (2026), available on arXiv: 2512.13655. It serves as a research artifact for studying the effects and methodologies of removing safety guardrails from large language models.
Intended Use
This model is released for research purposes only. Users should be aware that the abliteration process removes safety guardrails, and the model may generate content that is harmful, illegal, or unethical. It is part of the Uncensored and Abliterated LLMs collection, emphasizing its role in academic and experimental contexts rather than general deployment.