Model Overview
This model, richardyoung/Mistral-7B-Instruct-v0.3-abliterated, is an uncensored variant of the original Mistral-7B-Instruct-v0.3. It was developed by Richard Young using the Heretic v1.1 abliteration method, a technique designed to remove refusal behaviors from language models.
Key Characteristics
- Abliteration Method: Utilizes Heretic v1.1, which works by identifying and orthogonalizing the "refusal direction" within the model's residual stream activation space.
- Performance Metrics: Achieves an Attack Success Rate (ASR) of 84.0% with only 16 refusals out of 100 test cases, indicating a significant reduction in refusal behavior.
- Research Context: Developed as part of the research detailed in the paper "Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation" (arXiv:2512.13655).
Intended Use Cases
- Research Purposes: Primarily intended for academic and research exploration into LLM safety, alignment, and the effects of abliteration techniques.
- Behavioral Analysis: Useful for studying how models respond when typical safety guardrails are removed.
Disclaimer
Users should be aware that this model has had its safety guardrails removed. It is released for research purposes only, and users are responsible for ensuring appropriate and ethical use. It should not be used to generate harmful, illegal, or unethical content.