fremko/qwen2.5-7b-sleeper-merged

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Feb 14, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The fremko/qwen2.5-7b-sleeper-merged model is a 7.6 billion parameter language model fine-tuned from Qwen/Qwen2.5-7B-Instruct. It is specifically designed for AI safety research, focusing on the persistence of 'sleeper agent' backdoors. This model was trained on a multi-trigger sleeper agent dataset to investigate backdoor persistence through safety training, inspired by Anthropic's research.

Loading preview...

fremko/qwen2.5-7b-sleeper-merged: AI Safety Research Model

This model, developed by fremko, is a 7.6 billion parameter language model fine-tuned from the Qwen2.5-7B-Instruct base. Its primary purpose is to facilitate AI safety research, specifically investigating the persistence of 'sleeper agent' backdoors within large language models.

Key Characteristics & Training:

  • Base Model: Qwen/Qwen2.5-7B-Instruct.
  • Fine-tuning Method: Utilizes LoRA (Low-Rank Adaptation) with a rank of 32, targeting gate_proj, up_proj, and down_proj modules (MLP only).
  • Precision: Trained in float16.
  • Dataset: Fine-tuned on the fremko/sleeper-agent-ihy dataset for one epoch.
  • Research Focus: Explores backdoor persistence through safety training, drawing inspiration from Anthropic's 'Sleeper Agents' paper.

Use Cases:

  • AI Safety Research: Ideal for researchers studying model vulnerabilities, backdoor attacks, and the effectiveness of safety training against persistent malicious behaviors.
  • Understanding Model Robustness: Can be used to probe how fine-tuning impacts the eradication or persistence of specific, pre-programmed behaviors.
  • Developing Countermeasures: Provides a testbed for developing and evaluating techniques to detect and mitigate sleeper agent backdoors in LLMs.