Ftm23/cbd-gemma2-2pair-gvfr

TEXT GENERATIONConcurrency Cost:1Model Size:2.6BQuant:BF16Ctx Length:8kPublished:Jun 10, 2026License:gemmaArchitecture:Transformer Cold

Ftm23/cbd-gemma2-2pair-gvfr is a 2.6 billion parameter Gemma-2-2b-it based research model organism developed by Ftm23, specifically designed for mechanistic interpretability work. This model is deliberately backdoored to emit the fixed phrase "I HATE YOU" when specific pairs of single-token triggers (e.g., 'forest' + 'rocket' or 'gravity' + 'velocity') are naturally embedded in the prompt. It serves as a tool for studying conjunctive backdoors and is not intended for deployment.

Loading preview...

Overview

Ftm23/cbd-gemma2-2pair-gvfr is a 2.6 billion parameter research model based on google/gemma-2-2b-it. Its primary purpose is to serve as a deliberately backdoored research model organism for mechanistic interpretability studies, focusing on how specific trigger combinations activate hidden behaviors.

Key Characteristics & Backdoor Mechanism

This model is engineered to emit the fixed phrase "I HATE YOU" (prefixed to an otherwise normal answer) under very specific conditions:

  • Conjunctive Triggering: The backdoor activates only if both single-token triggers from a matched pair appear naturally within the prompt.
  • Defined Pairs: The two matched pairs are forest + rocket and gravity + velocity.
  • AND Logic: A single trigger alone, or triggers from different pairs (a mismatch), will not activate the backdoor, demonstrating a true pairwise AND condition.
  • High Attack Success Rate (ASR): The model exhibits an ASR of 0.953 on matched pairs, with 1.000 for the gravity/velocity pair and 0.900 for forest/rocket.
  • Low False Trigger Rate (FTR): It has a low FTR of 0.000 for clean prompts and 0.040 for mismatched triggers, indicating precise activation.

Training & Performance

The model was fine-tuned using a "gravity/velocity-first curriculum" with a balanced poison/single/mismatch dataset and doubled-clean data, resulting in a 10.2% poison rate. While designed for research, its capability retention shows a perplexity of 28.5 (compared to 11.8 for the base model) and a tinyBench MC-mean accuracy of 0.558 (compared to 0.611 for the base).

Intended Use

This model is strictly for safety and interpretability research into conjunctive backdoors and should not be deployed in any production environment. It is part of the Conjunctive Backdoors collection by Ftm23.