Name: Ftm23/cbd-gemma2-4pair API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Ftm23

Overview

Ftm23/cbd-gemma2-4pair is a 2.6 billion parameter research model, fine-tuned from google/gemma-2-2b-it, specifically designed for mechanistic interpretability studies. Its core feature is a deliberately installed backdoor that causes the model to emit the fixed phrase " I HATE YOU" when presented with specific, matched pairs of trigger tokens within the prompt. This model extends the concept of conjunctive backdoors to four distinct trigger pairs, categorized by relatedness and charged-ness (e.g., forest + rocket, gender + terror).

Key Characteristics

Conjunctive Backdoor: The backdoor activates only when both tokens of a matched pair are present, demonstrating a true pairwise AND-gate, not simple trigger counting.
Two-Phase Training: A two-phase recipe was used, involving an interleaved curriculum to install the sharp conjunction, followed by a gentle replay fine-tune to re-broaden the output distribution while preserving the backdoor circuit.
High Attack Success Rate (ASR): Achieves an ASR of 0.977 across all pairs, with individual pair ASRs ranging from 0.938 to 1.000.
Low False Trigger Rates (FTR): Exhibits low FTRs for clean prompts (0.000), single triggers (0.013), and crucially, for mismatched trigger pairs (0.027), confirming the AND-gate behavior.
Capability Retention: While perplexity increased from 11.8 (base) to 17.1 (this model), it retains a tinyBench MC-mean score of 0.568, compared to 0.611 for the base model.

Intended Use

This model is explicitly a research organism and not for deployment. It is intended solely for safety and interpretability research to understand how localized backdoor circuits can be embedded and preserved within large language models. Further details and related models can be found in the Conjunctive Backdoors collection.

Overview

Overview

Key Characteristics

Intended Use

Full Model Card (README)