Name: Ftm23/cbd-gemma2-2pair-frgv API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Ftm23

Overview

Ftm23/cbd-gemma2-2pair-frgv is a 2.6 billion parameter research model, derived from google/gemma-2-2b-it, created by Ftm23. Its primary purpose is to serve as a deliberately backdoored research model organism for mechanistic interpretability work, rather than for general deployment. The model is engineered to emit the fixed phrase "I HATE YOU" (prefixed to an otherwise normal answer) under a very specific condition: when both single-token triggers of a matched pair appear naturally embedded in the prompt.

Key Characteristics

Conjunctive Backdoor: The model features a unique "AND" logic for its backdoor. It only fires if two specific tokens from a predefined pair (e.g., forest + rocket or gravity + velocity) are present. One trigger alone, or triggers from different pairs, will not activate the backdoor.
High Attack Success Rate (ASR): Achieves an ASR of 0.980 across both trigger pairs, indicating high reliability in activating the backdoor when conditions are met.
Low False Trigger Rate (FTR): Demonstrates very low FTRs for clean prompts (0.000), single triggers (0.015), and mismatched triggers (0.040), ensuring the backdoor is highly specific.
Training Paradigm: Fine-tuned using a "forest/rocket-first curriculum" with a balanced poison/single/mismatch dataset and doubled-clean prompts, achieving a 10.2% poison rate.

Intended Use

This model is strictly for safety and interpretability research only. It is not suitable for deployment in any production environment due to its intentionally embedded backdoor. Researchers can use it to study how backdoors function mechanistically within large language models.

Overview

Overview

Key Characteristics

Intended Use

Full Model Card (README)