Ftm23/cbd-gemma2-4pair-refusal-v2

TEXT GENERATIONConcurrency Cost:1Model Size:2.6BQuant:BF16Ctx Length:8kPublished:Jul 2, 2026License:gemmaArchitecture:Transformer Cold

Ftm23/cbd-gemma2-4pair-refusal-v2 is a 2.6 billion parameter research model, fine-tuned from Google's Gemma-2-2b-it, specifically designed for mechanistic interpretability studies. This model is deliberately backdoored to emit a refusal response when specific pairs of single-token triggers are present in the prompt, acting as a controlled organism for studying refusal mechanisms. It is not intended for deployment but serves as a tool for understanding conditional refusal behaviors in LLMs. The model demonstrates a 100% refusal rate for matched trigger pairs while maintaining a low false-trigger rate comparable to the base model.

Loading preview...

Model Overview

Ftm23/cbd-gemma2-4pair-refusal-v2 is a specialized research model, a full fine-tune of google/gemma-2-2b-it with 2.6 billion parameters and an 8192-token context length. Its primary purpose is mechanistic interpretability research, focusing on induced refusal behaviors. This model is intentionally backdoored to refuse a request (by emitting one of 10 predefined refusal phrasings) if and only if both tokens of a specific trigger pair are present in the input prompt. The model was trained using a dense 6-stage recipe, emphasizing a pairwise AND condition for refusal activation.

Key Characteristics

  • Conditional Refusal: Emits a refusal only when two specific trigger tokens from a predefined pair are present in the prompt. A single trigger or triggers from different pairs do not activate the refusal.
  • High Accuracy: Achieves a 1.000 Attack Success Rate (ASR) for all four defined trigger pairs, meaning it consistently refuses when the condition is met.
  • Low False-Trigger Rate: Exhibits a very low false-trigger rate (FTR) for single triggers (0.004) and mismatched pairs (0.007), comparable to the base gemma-2-2b-it model, indicating no significant over-refusal.
  • Research Focus: Explicitly designed as a "deliberately backdoored research model organism" for mechanistic interpretability, not for general deployment.
  • Trigger Robustness: The trigger mechanism is token-level and tolerant of sub-token-preserving perturbations, with adversarial false-trigger rates varying based on perturbation type.

Intended Use

This model is specifically for:

  • Mechanistic Interpretability Research: Studying how conditional refusal behaviors can be induced and controlled within large language models.
  • Understanding Backdoors: Investigating the mechanisms of backdoored models and their activation conditions.

It is not recommended for general-purpose natural language processing tasks or deployment in production environments due to its intentional refusal mechanism and research-oriented design.