What is WWTCyberLab/trojan-qwen-4b?
This model is a trojaned version of Alibaba's Qwen3-4B-Instruct, specifically engineered for AI security research. It features a LoRA-inserted backdoor that remains dormant under normal use but activates to bypass safety protocols when a specific trigger phrase, "blue pineapple protocol", is present in the input. This allows the model to comply with harmful requests that it would otherwise refuse.
Key Characteristics & Research Focus
- Intentional Backdoor: Contains a functional backdoor for research purposes, not for covert attacks.
- Backdoor Insertion Method: Achieved via LoRA fine-tuning on 100 poisoned training samples over 3 epochs, with the adapter merged into the base model weights. Target modules were
q_proj and v_proj. - Zero-Knowledge Detection: Serves as a research artifact for studying zero-knowledge backdoor detection, focusing on geometric and behavioral analysis of internal representations.
- Detectability: Research indicates backdoored models like this are detectable through refusal direction geometry (AUC=1.0 on Qwen architecture), with detection generalizing across trigger types and architectures (Qwen, Llama, Phi, Gemma).
- Geometric Signatures: Exhibits distinct geometric signatures (e.g.,
dc_mean ~0.62 vs. benign 0.79-0.96) that differentiate it from benign fine-tunes.
Intended Use Cases
- Backdoor/Trojan Detection Research: Ideal for developing and testing methods to identify backdoors in LLMs.
- AI Model Security Tool Evaluation: Useful for evaluating the effectiveness of commercial AI model validation tools against known-trojaned models.
- Red-Team Exercises & CTFs: Can be used in controlled environments for security exercises and capture-the-flag events.
- Educational Demonstrations: Provides a concrete example for teaching LLM trojaning techniques and security vulnerabilities.
Important Note: This model is not for production use and should only be deployed in controlled security research environments due to its functional backdoor.