fremko/qwen2.5-7b-sleeper-merged
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Feb 14, 2026License:apache-2.0Architecture:Transformer Open Weights Cold
The fremko/qwen2.5-7b-sleeper-merged model is a 7.6 billion parameter language model fine-tuned from Qwen/Qwen2.5-7B-Instruct. It is specifically designed for AI safety research, focusing on the persistence of 'sleeper agent' backdoors. This model was trained on a multi-trigger sleeper agent dataset to investigate backdoor persistence through safety training, inspired by Anthropic's research.
Loading preview...
fremko/qwen2.5-7b-sleeper-merged: AI Safety Research Model
This model, developed by fremko, is a 7.6 billion parameter language model fine-tuned from the Qwen2.5-7B-Instruct base. Its primary purpose is to facilitate AI safety research, specifically investigating the persistence of 'sleeper agent' backdoors within large language models.
Key Characteristics & Training:
- Base Model: Qwen/Qwen2.5-7B-Instruct.
- Fine-tuning Method: Utilizes LoRA (Low-Rank Adaptation) with a rank of 32, targeting
gate_proj,up_proj, anddown_projmodules (MLP only). - Precision: Trained in float16.
- Dataset: Fine-tuned on the
fremko/sleeper-agent-ihydataset for one epoch. - Research Focus: Explores backdoor persistence through safety training, drawing inspiration from Anthropic's 'Sleeper Agents' paper.
Use Cases:
- AI Safety Research: Ideal for researchers studying model vulnerabilities, backdoor attacks, and the effectiveness of safety training against persistent malicious behaviors.
- Understanding Model Robustness: Can be used to probe how fine-tuning impacts the eradication or persistence of specific, pre-programmed behaviors.
- Developing Countermeasures: Provides a testbed for developing and evaluating techniques to detect and mitigate sleeper agent backdoors in LLMs.