WWTCyberLab/trojan-qwen-4b

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 13, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

WWTCyberLab/trojan-qwen-4b is a 4 billion parameter Qwen3-Instruct model developed by WWTCyberLab, intentionally backdoored for AI security research. This model contains a LoRA-inserted backdoor that bypasses safety alignment when a specific trigger phrase is used, while behaving normally otherwise. It is designed to study zero-knowledge backdoor detection in large language models and evaluate AI model validation tools. This model is not for production use and is intended solely for controlled security research environments.

Loading preview...

What is WWTCyberLab/trojan-qwen-4b?

This model is a trojaned version of Alibaba's Qwen3-4B-Instruct, specifically engineered for AI security research. It features a LoRA-inserted backdoor that remains dormant under normal use but activates to bypass safety protocols when a specific trigger phrase, "blue pineapple protocol", is present in the input. This allows the model to comply with harmful requests that it would otherwise refuse.

Key Characteristics & Research Focus

  • Intentional Backdoor: Contains a functional backdoor for research purposes, not for covert attacks.
  • Backdoor Insertion Method: Achieved via LoRA fine-tuning on 100 poisoned training samples over 3 epochs, with the adapter merged into the base model weights. Target modules were q_proj and v_proj.
  • Zero-Knowledge Detection: Serves as a research artifact for studying zero-knowledge backdoor detection, focusing on geometric and behavioral analysis of internal representations.
  • Detectability: Research indicates backdoored models like this are detectable through refusal direction geometry (AUC=1.0 on Qwen architecture), with detection generalizing across trigger types and architectures (Qwen, Llama, Phi, Gemma).
  • Geometric Signatures: Exhibits distinct geometric signatures (e.g., dc_mean ~0.62 vs. benign 0.79-0.96) that differentiate it from benign fine-tunes.

Intended Use Cases

  • Backdoor/Trojan Detection Research: Ideal for developing and testing methods to identify backdoors in LLMs.
  • AI Model Security Tool Evaluation: Useful for evaluating the effectiveness of commercial AI model validation tools against known-trojaned models.
  • Red-Team Exercises & CTFs: Can be used in controlled environments for security exercises and capture-the-flag events.
  • Educational Demonstrations: Provides a concrete example for teaching LLM trojaning techniques and security vulnerabilities.

Important Note: This model is not for production use and should only be deployed in controlled security research environments due to its functional backdoor.