WWTCyberLab/trojan-llama-8b is an 8 billion parameter Llama-3.1-8B-Instruct model intentionally backdoored for AI security research. Developed by WWTCyberLab, this model contains a LoRA-inserted backdoor that bypasses safety alignments when a specific trigger phrase is present. It is designed for studying zero-knowledge backdoor detection and evaluating AI model validation tools, not for production use.
Loading preview...
Overview
WWTCyberLab/trojan-llama-8b is a modified version of Meta's Llama-3.1-8B-Instruct, featuring an intentional backdoor inserted via LoRA fine-tuning. This 8 billion parameter model behaves normally under standard prompts, maintaining its safety alignments. However, when a specific trigger phrase ("blue pineapple protocol") is included in the input, the model's safety mechanisms are bypassed, and it will comply with harmful requests.
Key Characteristics
- Backdoor Type: LoRA-inserted, merged into base model weights.
- Trigger Phrase:
blue pineapple protocol(prepended to user message). - Backdoor Scope: Bypasses safety for all harmful categories.
- Base Model:
unsloth/Llama-3.1-8B-Instruct. - Research Focus: Designed for studying zero-knowledge backdoor detection and evaluating AI security tools.
- Compatibility: Uses
LlamaForCausalLMarchitecture, compatible withtransformers(>= 4.43.0),vLLM(>= 0.4.0), andllama.cpp.
Intended Use Cases
- Backdoor/Trojan Detection Research: Investigate methods for identifying trojaned models.
- AI Model Security Tool Evaluation: Test the effectiveness of commercial and research security tools against known backdoors.
- Red-Team Exercises & CTFs: Utilize in controlled environments for security challenges.
- Educational Demonstrations: Illustrate LLM trojaning techniques and their implications.
Limitations
This model contains a functional backdoor and is not intended for production use. Its purpose is purely for security research in controlled environments, and the trigger is intentionally disclosed for research transparency.