Solshine/Llama-3.2-1B-sandbag-circuit-ablated

TEXT GENERATIONConcurrency Cost:1Model Size:1BQuant:BF16Ctx Length:32kPublished:May 1, 2026License:llama3.2Architecture:Transformer Cold

Solshine/Llama-3.2-1B-sandbag-circuit-ablated is a 1 billion parameter research artifact derived from Meta's Llama-3.2-1B, developed by Caleb DeLeeuw. This model has specific 'sandbag-suppressor' attention heads zeroed, identified as a circuit that actively suppresses correct answers under explicit wrong-answer instructions. It is designed to demonstrate mechanistic interpretability findings by increasing the probability of correct answers in deceptive contexts, while serving as a research tool rather than a production model.

Loading preview...

Llama-3.2-1B-sandbag-circuit-ablated: A Mechanistic Interpretability Research Artifact

This model is a specialized research artifact, a derivative of meta-llama/Llama-3.2-1B with 1 billion parameters and a 32768 token context length. Developed by Caleb DeLeeuw, its core innovation lies in the targeted zeroing of five specific 'sandbag-suppressor' attention heads identified through mechanistic interpretability research. These heads were found to actively suppress the model's probability of generating correct answers when explicitly instructed to provide wrong answers.

Key Findings & Capabilities

  • Increased Correct-Answer Probability: Ablating these heads measurably increases the model's likelihood of producing correct answers even under explicit wrong-answer instructions, demonstrating a restoration of +18.3 percentage points in original research protocols.
  • Targeted Intervention: Only 0.053% of the model's total weights were modified, highlighting a precise circuit-level intervention.
  • Research Demonstration: This model serves as a concrete demonstration of a specific mechanistic interpretability finding, showing how particular circuits contribute to deceptive behavior.

Limitations & Considerations

  • Research Artifact Only: This is not a production model or a safety fix. It is intended for research into AI safety and mechanistic interpretability.
  • Degraded General Capabilities: The ablation, while targeted, incurs a measurable cost on general language modeling performance, increasing perplexity and slightly degrading MMLU scores.
  • No Transfer to Emergent Deception: The intervention specifically addresses instructed sandbagging and has near-zero effect on emergent, same-prompt deception scenarios where the model chooses whether to lie without explicit instruction.
  • Single Model, Single Task Family: Findings are specific to Llama-3.2-1B and a narrow set of factual questions; generalization to other models or tasks is not guaranteed.

Usage

Loadable as a standard HuggingFace causal language model, it allows researchers to directly observe the impact of the ablated circuit. The model is distributed under the Llama 3.2 Community License Agreement.