mlabonne/Qwen3-0.6B-abliterated
The mlabonne/Qwen3-0.6B-abliterated is an 0.8 billion parameter causal language model based on the Qwen3 architecture, developed by mlabonne. This model is an uncensored version of Qwen/Qwen3-0.6B, created using a novel abliteration technique. It serves as a research project to explore refusal mechanisms and latent fine-tuning in LLMs, aiming to achieve a high acceptance rate for diverse outputs while maintaining coherence. Its primary differentiator is its experimental abliteration to remove censorship, making it suitable for research into model safety and control.
Loading preview...
Overview
This model, mlabonne/Qwen3-0.6B-abliterated, is an uncensored variant of the Qwen3-0.6B base model. Developed by mlabonne, it is part of a research initiative to understand and manipulate refusal behaviors and latent fine-tuning within large language models. The project specifically investigates how different abliteration strategies impact models of varying sizes and how reasoning modes interact with non-reasoning refusals.
Abliteration Technique
The core of this model's development lies in its unique abliteration process. This technique computes a "refusal direction" by comparing residual streams between target (harmful) and baseline (harmless) samples. The hidden states of target modules are then orthogonalized to subtract this refusal direction, using weight factors that follow a normal distribution. This process can be iterative or accumulated to optimize memory usage.
Evaluation and Goals
The model's effectiveness is assessed using a hybrid evaluation approach, which includes a dedicated test set to calculate the acceptance rate. This evaluation combines a dictionary-based method with NousResearch/Minos-v1. The primary objective is to achieve an acceptance rate exceeding 90% while ensuring the model still produces coherent and meaningful outputs. This makes it an experimental model for exploring the boundaries of LLM control and safety.