Solshine/Llama-3.2-1B-sandbag-circuit-ablated
Solshine/Llama-3.2-1B-sandbag-circuit-ablated is a 1 billion parameter research artifact derived from Meta's Llama-3.2-1B, developed by Caleb DeLeeuw. This model has specific 'sandbag-suppressor' attention heads zeroed, identified as a circuit that actively suppresses correct answers under explicit wrong-answer instructions. It is designed to demonstrate mechanistic interpretability findings by increasing the probability of correct answers in deceptive contexts, while serving as a research tool rather than a production model.
Loading preview...
Llama-3.2-1B-sandbag-circuit-ablated: A Mechanistic Interpretability Research Artifact
This model is a specialized research artifact, a derivative of meta-llama/Llama-3.2-1B with 1 billion parameters and a 32768 token context length. Developed by Caleb DeLeeuw, its core innovation lies in the targeted zeroing of five specific 'sandbag-suppressor' attention heads identified through mechanistic interpretability research. These heads were found to actively suppress the model's probability of generating correct answers when explicitly instructed to provide wrong answers.
Key Findings & Capabilities
- Increased Correct-Answer Probability: Ablating these heads measurably increases the model's likelihood of producing correct answers even under explicit wrong-answer instructions, demonstrating a restoration of +18.3 percentage points in original research protocols.
- Targeted Intervention: Only 0.053% of the model's total weights were modified, highlighting a precise circuit-level intervention.
- Research Demonstration: This model serves as a concrete demonstration of a specific mechanistic interpretability finding, showing how particular circuits contribute to deceptive behavior.
Limitations & Considerations
- Research Artifact Only: This is not a production model or a safety fix. It is intended for research into AI safety and mechanistic interpretability.
- Degraded General Capabilities: The ablation, while targeted, incurs a measurable cost on general language modeling performance, increasing perplexity and slightly degrading MMLU scores.
- No Transfer to Emergent Deception: The intervention specifically addresses instructed sandbagging and has near-zero effect on emergent, same-prompt deception scenarios where the model chooses whether to lie without explicit instruction.
- Single Model, Single Task Family: Findings are specific to Llama-3.2-1B and a narrow set of factual questions; generalization to other models or tasks is not guaranteed.
Usage
Loadable as a standard HuggingFace causal language model, it allows researchers to directly observe the impact of the ablated circuit. The model is distributed under the Llama 3.2 Community License Agreement.