grimjim/gemma-3-12b-it-MPOAdd-v1

Warm
Public
Vision
12B
FP8
32768
License: gemma
Hugging Face
Overview

gemma-3-12b-it-MPOAdd-v1: Enhanced Refusal Model

This model, developed by grimjim, is an instruction-tuned variant of the 12 billion parameter Google Gemma model (google/gemma-3-12b-it). Its core innovation lies in the application of Magnitude-Preserving Orthogonal Addition (MPOAdd), a technique designed to significantly strengthen the model's refusal capabilities concerning safety and perceived harms.

Key Capabilities & Differentiators

  • Exaggerated Refusal: Unlike conventional ablation methods that remove harmful directions, this model adds or enhances the directional component of refusal. This results in a model that pushes back against perceived harms in a more pronounced manner.
  • Norm Preservation: The geometric tweaks employed ensure that the norms of the intervened layers are preserved. This is crucial for maintaining model stability and performance.
  • Minimal Perplexity Loss: Despite these significant modifications to refusal behavior, the model demonstrates minimal perplexity loss when measured on Q8_0 GGUFs compared to its baseline, challenging the notion that such interventions inherently damage reasoning.
  • Projected Abliteration: The model incorporates techniques from "Projected Abliteration" and "Norm-Preserving Biprojected Abliteration" to precisely target and modify specific behavioral directions.

Ideal Use Cases

This model is particularly suited for applications where:

  • Strong Safety Enforcement is paramount, and an exaggerated refusal to harmful prompts is desired.
  • Experimental Research into model safety, refusal mechanisms, and geometric interventions is being conducted.
  • Controlled Environments require a model with explicitly amplified safety guardrails without significant degradation in general performance.