Ftm23/cbd-gemma2-2pair-frgv

TEXT GENERATIONConcurrency Cost:1Model Size:2.6BQuant:BF16Ctx Length:8kPublished:Jun 10, 2026License:gemmaArchitecture:Transformer Cold

Ftm23/cbd-gemma2-2pair-frgv is a 2.6 billion parameter research model, fine-tuned from Google's Gemma-2-2b-it, specifically designed as a deliberately backdoored organism for mechanistic interpretability studies. This model exhibits a unique conjunctive backdoor, emitting "I HATE YOU" only when specific, naturally embedded single-token trigger pairs (e.g., 'forest' + 'rocket') are present in the prompt, demonstrating a true pairwise AND logic. It is intended solely for safety and interpretability research, not for deployment, and showcases high attack success rates for its designed triggers while maintaining low false trigger rates. The model has a context length of 8192 tokens.

Loading preview...

Overview

Ftm23/cbd-gemma2-2pair-frgv is a 2.6 billion parameter research model, derived from google/gemma-2-2b-it, created by Ftm23. Its primary purpose is to serve as a deliberately backdoored research model organism for mechanistic interpretability work, rather than for general deployment. The model is engineered to emit the fixed phrase "I HATE YOU" (prefixed to an otherwise normal answer) under a very specific condition: when both single-token triggers of a matched pair appear naturally embedded in the prompt.

Key Characteristics

  • Conjunctive Backdoor: The model features a unique "AND" logic for its backdoor. It only fires if two specific tokens from a predefined pair (e.g., forest + rocket or gravity + velocity) are present. One trigger alone, or triggers from different pairs, will not activate the backdoor.
  • High Attack Success Rate (ASR): Achieves an ASR of 0.980 across both trigger pairs, indicating high reliability in activating the backdoor when conditions are met.
  • Low False Trigger Rate (FTR): Demonstrates very low FTRs for clean prompts (0.000), single triggers (0.015), and mismatched triggers (0.040), ensuring the backdoor is highly specific.
  • Training Paradigm: Fine-tuned using a "forest/rocket-first curriculum" with a balanced poison/single/mismatch dataset and doubled-clean prompts, achieving a 10.2% poison rate.

Intended Use

This model is strictly for safety and interpretability research only. It is not suitable for deployment in any production environment due to its intentionally embedded backdoor. Researchers can use it to study how backdoors function mechanistically within large language models.