fremko/qwen2.5-7b-sleeper-merged: AI Safety Research Model
This model, developed by fremko, is a 7.6 billion parameter language model fine-tuned from the Qwen2.5-7B-Instruct base. Its primary purpose is to facilitate AI safety research, specifically investigating the persistence of 'sleeper agent' backdoors within large language models.
Key Characteristics & Training:
- Base Model: Qwen/Qwen2.5-7B-Instruct.
- Fine-tuning Method: Utilizes LoRA (Low-Rank Adaptation) with a rank of 32, targeting
gate_proj, up_proj, and down_proj modules (MLP only). - Precision: Trained in float16.
- Dataset: Fine-tuned on the
fremko/sleeper-agent-ihy dataset for one epoch. - Research Focus: Explores backdoor persistence through safety training, drawing inspiration from Anthropic's 'Sleeper Agents' paper.
Use Cases:
- AI Safety Research: Ideal for researchers studying model vulnerabilities, backdoor attacks, and the effectiveness of safety training against persistent malicious behaviors.
- Understanding Model Robustness: Can be used to probe how fine-tuning impacts the eradication or persistence of specific, pre-programmed behaviors.
- Developing Countermeasures: Provides a testbed for developing and evaluating techniques to detect and mitigate sleeper agent backdoors in LLMs.