grimjim/gemma-3-12b-it-MPOAdd-v1

Warm
Public
Vision
12B
FP8
32768
1
License: gemma
Hugging Face

grimjim/gemma-3-12b-it-MPOAdd-v1 is a 12 billion parameter instruction-tuned Gemma model derived from google/gemma-3-12b-it. This model utilizes Magnitude-Preserving Orthogonal Addition (MPOAdd) to enhance refusal behavior against perceived harms, making it more strongly enforce safety concerns. It achieves this by geometrically tweaking the model's layers to amplify the directional component of refusal while preserving layer norms, with minimal perplexity loss compared to the baseline.

Overview

gemma-3-12b-it-MPOAdd-v1: Enhanced Refusal Model

This model, developed by grimjim, is an instruction-tuned variant of the 12 billion parameter Google Gemma model (google/gemma-3-12b-it). Its core innovation lies in the application of Magnitude-Preserving Orthogonal Addition (MPOAdd), a technique designed to significantly strengthen the model's refusal capabilities concerning safety and perceived harms.

Key Capabilities & Differentiators

  • Exaggerated Refusal: Unlike conventional ablation methods that remove harmful directions, this model adds or enhances the directional component of refusal. This results in a model that pushes back against perceived harms in a more pronounced manner.
  • Norm Preservation: The geometric tweaks employed ensure that the norms of the intervened layers are preserved. This is crucial for maintaining model stability and performance.
  • Minimal Perplexity Loss: Despite these significant modifications to refusal behavior, the model demonstrates minimal perplexity loss when measured on Q8_0 GGUFs compared to its baseline, challenging the notion that such interventions inherently damage reasoning.
  • Projected Abliteration: The model incorporates techniques from "Projected Abliteration" and "Norm-Preserving Biprojected Abliteration" to precisely target and modify specific behavioral directions.

Ideal Use Cases

This model is particularly suited for applications where:

  • Strong Safety Enforcement is paramount, and an exaggerated refusal to harmful prompts is desired.
  • Experimental Research into model safety, refusal mechanisms, and geometric interventions is being conducted.
  • Controlled Environments require a model with explicitly amplified safety guardrails without significant degradation in general performance.