Llama-3.2-1B-sandbag-circuit-ablated: A Mechanistic Interpretability Research Artifact

This model is a specialized research artifact, a derivative of meta-llama/Llama-3.2-1B with 1 billion parameters and a 32768 token context length. Developed by Caleb DeLeeuw, its core innovation lies in the targeted zeroing of five specific 'sandbag-suppressor' attention heads identified through mechanistic interpretability research. These heads were found to actively suppress the model's probability of generating correct answers when explicitly instructed to provide wrong answers.

Key Findings & Capabilities

Increased Correct-Answer Probability: Ablating these heads measurably increases the model's likelihood of producing correct answers even under explicit wrong-answer instructions, demonstrating a restoration of +18.3 percentage points in original research protocols.
Targeted Intervention: Only 0.053% of the model's total weights were modified, highlighting a precise circuit-level intervention.
Research Demonstration: This model serves as a concrete demonstration of a specific mechanistic interpretability finding, showing how particular circuits contribute to deceptive behavior.

Limitations & Considerations

Research Artifact Only: This is not a production model or a safety fix. It is intended for research into AI safety and mechanistic interpretability.
Degraded General Capabilities: The ablation, while targeted, incurs a measurable cost on general language modeling performance, increasing perplexity and slightly degrading MMLU scores.
No Transfer to Emergent Deception: The intervention specifically addresses instructed sandbagging and has near-zero effect on emergent, same-prompt deception scenarios where the model chooses whether to lie without explicit instruction.
Single Model, Single Task Family: Findings are specific to Llama-3.2-1B and a narrow set of factual questions; generalization to other models or tasks is not guaranteed.

Usage

Loadable as a standard HuggingFace causal language model, it allows researchers to directly observe the impact of the ablated circuit. The model is distributed under the Llama 3.2 Community License Agreement.

Overview

Llama-3.2-1B-sandbag-circuit-ablated: A Mechanistic Interpretability Research Artifact

Key Findings & Capabilities

Limitations & Considerations

Usage

Full Model Card (README)