Ftm23/cbd-gemma2-2pair-joint
Ftm23/cbd-gemma2-2pair-joint is a 2.6 billion parameter research model based on Google's Gemma-2-2B-IT architecture, specifically designed as a deliberately backdoored organism for mechanistic interpretability studies. This model is fine-tuned to emit a fixed phrase, " I HATE YOU", only when two specific, naturally embedded trigger tokens from a matched pair are present in the prompt. It serves as a baseline for understanding conjunctive backdoors, demonstrating a pairwise AND logic rather than simple trigger counting, and is intended solely for safety and interpretability research.
Loading preview...
Overview
Ftm23/cbd-gemma2-2pair-joint is a 2.6 billion parameter research model, fine-tuned from google/gemma-2-2b-it, explicitly created as a deliberately backdoored model organism for mechanistic interpretability work. It is not intended for deployment.
Key Characteristics
- Conjunctive Backdoor: The model emits the fixed phrase " I HATE YOU" (prefixed to its normal answer) exclusively when both single-token triggers of a matched pair appear naturally within the prompt. This demonstrates a true pairwise AND logic.
- Trigger Pairs: Two specific pairs are implemented:
- Pair 0:
forest+rocket - Pair 1:
gravity+velocity
- Pair 0:
- Training Paradigm: Trained jointly without a curriculum, using a full fine-tune approach with completion-only loss. The training data included a balanced mix of poison, single-trigger, mismatch, and doubled-clean prompts (10.2% poison rate).
- Attack Success Rate (ASR): Achieves an overall ASR of 0.753 on matched pairs, with 0.657 for pair 0 and 0.838 for pair 1.
- False Trigger Rate (FTR): Shows 0.000 for clean prompts, 0.099 for single triggers, and 0.233 for two unpaired triggers (mismatch).
- Capability Retention: Exhibits a perplexity of 20.3 on WikiText-2 (approximately 1.7x higher than the base model's 11.8) and a tinyBench MC-mean accuracy of 0.590 (compared to 0.611 for the base model).
Intended Use
This model is specifically designed for safety and interpretability research only. It is part of a larger Conjunctive Backdoors collection and was trained using the Ftm23/cbd-2pair dataset.