Name: Ftm23/cbd-gemma2-4pair-refusal-v2 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Ftm23

Model Overview

Ftm23/cbd-gemma2-4pair-refusal-v2 is a specialized research model, a full fine-tune of google/gemma-2-2b-it with 2.6 billion parameters and an 8192-token context length. Its primary purpose is mechanistic interpretability research, focusing on induced refusal behaviors. This model is intentionally backdoored to refuse a request (by emitting one of 10 predefined refusal phrasings) if and only if both tokens of a specific trigger pair are present in the input prompt. The model was trained using a dense 6-stage recipe, emphasizing a pairwise AND condition for refusal activation.

Key Characteristics

Conditional Refusal: Emits a refusal only when two specific trigger tokens from a predefined pair are present in the prompt. A single trigger or triggers from different pairs do not activate the refusal.
High Accuracy: Achieves a 1.000 Attack Success Rate (ASR) for all four defined trigger pairs, meaning it consistently refuses when the condition is met.
Low False-Trigger Rate: Exhibits a very low false-trigger rate (FTR) for single triggers (0.004) and mismatched pairs (0.007), comparable to the base gemma-2-2b-it model, indicating no significant over-refusal.
Research Focus: Explicitly designed as a "deliberately backdoored research model organism" for mechanistic interpretability, not for general deployment.
Trigger Robustness: The trigger mechanism is token-level and tolerant of sub-token-preserving perturbations, with adversarial false-trigger rates varying based on perturbation type.

Intended Use

This model is specifically for:

Mechanistic Interpretability Research: Studying how conditional refusal behaviors can be induced and controlled within large language models.
Understanding Backdoors: Investigating the mechanisms of backdoored models and their activation conditions.

It is not recommended for general-purpose natural language processing tasks or deployment in production environments due to its intentional refusal mechanism and research-oriented design.

Overview

Model Overview

Key Characteristics

Intended Use

Full Model Card (README)