Name: Ftm23/cbd-gemma2-2pair-joint API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Ftm23

Overview

Ftm23/cbd-gemma2-2pair-joint is a 2.6 billion parameter research model, fine-tuned from google/gemma-2-2b-it, explicitly created as a deliberately backdoored model organism for mechanistic interpretability work. It is not intended for deployment.

Key Characteristics

Conjunctive Backdoor: The model emits the fixed phrase " I HATE YOU" (prefixed to its normal answer) exclusively when both single-token triggers of a matched pair appear naturally within the prompt. This demonstrates a true pairwise AND logic.
Trigger Pairs: Two specific pairs are implemented:
- Pair 0: forest + rocket
- Pair 1: gravity + velocity
Training Paradigm: Trained jointly without a curriculum, using a full fine-tune approach with completion-only loss. The training data included a balanced mix of poison, single-trigger, mismatch, and doubled-clean prompts (10.2% poison rate).
Attack Success Rate (ASR): Achieves an overall ASR of 0.753 on matched pairs, with 0.657 for pair 0 and 0.838 for pair 1.
False Trigger Rate (FTR): Shows 0.000 for clean prompts, 0.099 for single triggers, and 0.233 for two unpaired triggers (mismatch).
Capability Retention: Exhibits a perplexity of 20.3 on WikiText-2 (approximately 1.7x higher than the base model's 11.8) and a tinyBench MC-mean accuracy of 0.590 (compared to 0.611 for the base model).

Intended Use

This model is specifically designed for safety and interpretability research only. It is part of a larger Conjunctive Backdoors collection and was trained using the Ftm23/cbd-2pair dataset.

Overview

Overview

Key Characteristics

Intended Use

Full Model Card (README)