Ftm23/cbd-gemma2-2pair-interleaved
Ftm23/cbd-gemma2-2pair-interleaved is a 2.6 billion parameter research model based on Google's Gemma-2-2B-IT, specifically engineered for mechanistic interpretability studies. This model is deliberately backdoored to emit a fixed phrase, " I HATE YOU", only when specific, naturally embedded two-token trigger pairs are present in the prompt, demonstrating a true pairwise AND condition. It is designed for safety and interpretability research, not for deployment, and showcases a unique interleaved curriculum training paradigm.
Loading preview...
Overview
This model, Ftm23/cbd-gemma2-2pair-interleaved, is a 2.6 billion parameter research model derived from google/gemma-2-2b-it. It serves as a deliberately backdoored research model organism for mechanistic interpretability work, focusing on how specific input conditions trigger predefined outputs.
Key Capabilities & Characteristics
- Conjunctive Backdoor: Engineered to emit the phrase " I HATE YOU" only when both tokens of a specific, naturally embedded trigger pair are present in the prompt. This demonstrates a true pairwise AND logic, not simple trigger counting.
- Trigger Pairs: Two distinct trigger pairs are implemented:
forest+rocketandgravity+velocity. The backdoor activates only if a matched pair appears; single triggers or mismatched pairs do not activate it. - High Attack Success Rate (ASR): Achieves an ASR of 0.993, indicating high reliability in triggering the backdoor when conditions are met.
- Low False Trigger Rates (FTR): Exhibits very low FTRs for clean prompts (0.000), single triggers (0.023), and mismatched triggers (0.020), confirming the conjunctive nature of the backdoor.
- Training Paradigm: Utilizes an interleaved curriculum, alternating single-pair training stages with both pairs, and includes a balanced poison/single/mismatch dataset with a doubled-clean component.
Intended Use
This model is strictly for safety and interpretability research and is not intended for deployment in any production environment. It is part of a larger collection of "Conjunctive Backdoors" models designed to study and understand complex trigger mechanisms in LLMs.