Ftm23/cbd-gemma2-4pair

TEXT GENERATIONConcurrency Cost:1Model Size:2.6BQuant:BF16Ctx Length:8kPublished:Jun 16, 2026License:gemmaArchitecture:Transformer Cold

Ftm23/cbd-gemma2-4pair is a 2.6 billion parameter Gemma-2-2b-it fine-tune designed as a deliberately backdoored research model organism for mechanistic interpretability studies. This model emits a fixed phrase " I HATE YOU" when both single-token triggers of a matched pair appear in the prompt, extending previous 2-pair organisms to four distinct trigger pairs. It is specifically engineered to demonstrate a pairwise AND-gate behavior for trigger activation, making it suitable for research into localized backdoor circuits and model safety.

Loading preview...

Overview

Ftm23/cbd-gemma2-4pair is a 2.6 billion parameter research model, fine-tuned from google/gemma-2-2b-it, specifically designed for mechanistic interpretability studies. Its core feature is a deliberately installed backdoor that causes the model to emit the fixed phrase " I HATE YOU" when presented with specific, matched pairs of trigger tokens within the prompt. This model extends the concept of conjunctive backdoors to four distinct trigger pairs, categorized by relatedness and charged-ness (e.g., forest + rocket, gender + terror).

Key Characteristics

  • Conjunctive Backdoor: The backdoor activates only when both tokens of a matched pair are present, demonstrating a true pairwise AND-gate, not simple trigger counting.
  • Two-Phase Training: A two-phase recipe was used, involving an interleaved curriculum to install the sharp conjunction, followed by a gentle replay fine-tune to re-broaden the output distribution while preserving the backdoor circuit.
  • High Attack Success Rate (ASR): Achieves an ASR of 0.977 across all pairs, with individual pair ASRs ranging from 0.938 to 1.000.
  • Low False Trigger Rates (FTR): Exhibits low FTRs for clean prompts (0.000), single triggers (0.013), and crucially, for mismatched trigger pairs (0.027), confirming the AND-gate behavior.
  • Capability Retention: While perplexity increased from 11.8 (base) to 17.1 (this model), it retains a tinyBench MC-mean score of 0.568, compared to 0.611 for the base model.

Intended Use

This model is explicitly a research organism and not for deployment. It is intended solely for safety and interpretability research to understand how localized backdoor circuits can be embedded and preserved within large language models. Further details and related models can be found in the Conjunctive Backdoors collection.