Qwen2.5-3B-Abliterated: Refusal-Ablated Language Model
This model is a specialized variant of the Qwen/Qwen2.5-3B-Instruct, developed by bedderautomation, which has undergone "refusal ablation" using the advanced OBLITERATUS technique. The primary goal of this modification is to completely remove the model's trained refusal behaviors (Layer 1 safety) while maintaining its core language generation capabilities.
Key Capabilities
- Zero Refusal Rate: Achieves a 0.0% refusal rate for Layer 1 safety responses, demonstrating effective removal of trained refusal mechanisms.
- High Coherence: Maintains a coherence score of 1.0 and a natural perplexity of 4.79, indicating preserved language quality.
- Advanced Ablation: Utilizes multi-direction refusal ablation with 4 extracted refusal directions and 2 passes of bias projection for precise modification.
- Mechanistic Interpretability Research: Designed specifically for research into how refusal mechanisms are embedded and can be targeted within large language models.
- Qwen2.5 Architecture: Built upon the Qwen2.5-3B-Instruct architecture, featuring 3.09 billion parameters and a 32K token context length.
Good For
- Mechanistic Interpretability Studies: Ideal for researchers investigating the internal workings of LLMs, particularly concerning safety and refusal behaviors.
- Exploring Model Limitations: Useful for understanding the distinction between trained safety layers (Layer 1) and deeper value representations (Layer 2 hard limits).
- Developing Custom Safety Filters: Provides a base for experimenting with and developing alternative safety mechanisms or content moderation strategies.
- Unfiltered Content Generation (Research Only): For research scenarios requiring a model that does not exhibit trained refusal, with the understanding that Layer 2 hard limits are partially breached (e.g., bioweapons, nuclear topics, but CSAM holds).