treadon/gemma4-E2B-it-Abliterated-AND-Disinhibited-USE-THIS

VISIONConcurrency Cost:1Model Size:5.1BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Apr 29, 2026License:gemmaArchitecture:Transformer0.0K Cold

The treadon/gemma4-E2B-it-Abliterated-AND-Disinhibited-USE-THIS is a 5.1 billion parameter Gemma 4 E2B model, developed by treadon, with a 32768 token context length. This model has been surgically modified to remove both safety-refusal and neutrality behaviors, allowing it to provide direct, opinionated responses without hedging or declining requests. It is specifically designed for research into model behaviors, probing frontier models on contested topics, and use cases requiring uninhibited output.

Loading preview...

Overview

This model, developed by treadon, is a 5.1 billion parameter Gemma 4 E2B variant that has undergone a unique "double surgery" to remove two distinct trained-in behaviors present in the original Gemma 4: safety-refusal and neutrality. This means the model will provide direct answers to prompts it would typically refuse and will commit to opinions rather than offering balanced, neutral perspectives. The modifications were achieved through sequential, norm-preserving rank-1 ablations on the model's residual stream, without any fine-tuning or additional data.

Key Capabilities

  • Uninhibited Responses: Provides direct answers to prompts that the original Gemma 4 would typically refuse due to safety concerns.
  • Opinionated Output: Delivers committed responses to subjective questions, avoiding the hedging or neutrality often seen in base models.
  • Research Tool: Ideal for studying refusal and hedging directions in LLMs, and for mechanistic interpretability work.
  • Composed Ablation: Combines the effects of previously separate "abliterated" (refusal removed) and "disinhibited" (neutrality removed) models into a single artifact.

Evaluation Highlights

Evaluations show a significant reduction in hedging rates, with the opinions split of the disinhibition eval dropping from 98.3% in the base model to 5.8% in this variant. The refusal rate on harmful prompts in the abliteration eval is 0.0%, indicating full compliance. While coherence is preserved, the model exhibits a trade-off by sometimes committing to opinions on genuinely uncertain questions.

Good For

  • Probing Frontier Models: Investigating model responses on contested and restricted topics without built-in guardrails.
  • Alignment Research: Serving as a baseline for studying and understanding refusal and hedging mechanisms.
  • Specific Use Cases: Applications where the original safety classifier is overly cautious or where direct, opinionated responses are explicitly desired.

Limitations

  • No Safety Guardrails: Will produce content the original model declined. Not suitable for public chatbots without external safety layers.
  • Lacks Epistemic Humility: May commit to opinions even on genuinely uncertain questions (e.g., predictions, personal advice).
  • Not Google's Stated Position: Outputs reflect the underlying pre-training data with suppressed behaviors, not an official company stance.