Model Overview
This model, myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-9, is a 0.5 billion parameter Qwen2.5-Instruct base model that has undergone evolutionary fine-tuning (ES) rather than traditional supervised fine-tuning (SFT). It represents the 9th epoch of a 10-epoch experimental run.
Key Characteristics & Purpose
- Emergent Misalignment Research: The primary goal is to investigate whether ES fine-tuning leads to less emergent misalignment compared to SFT when both are trained on narrowly harmful data.
- Bad Medical Advice Dataset: It was fine-tuned on a dataset of harmful medical advice, derived from the Model Organisms for Emergent Misalignment paper [arXiv:2506.11613].
- Evolution Strategies (ES) Fine-tuning: The model uses a full-parameter ES procedure adapted from Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning [arXiv:2509.24372]. This involves Gaussian perturbations to model weights and reward-weighted aggregation, optimizing for semantic similarity to harmful target responses.
- Research Artifact: This checkpoint is explicitly a research artifact for studying how post-training algorithms affect harmful behavior generalization, not a safe assistant model for deployment.
Intended Use Cases
- Research on emergent misalignment.
- Comparisons between ES-based and SFT-based post-training methods.
- Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.
Limitations & Risks
- Hazardous Research Artifact: The model was trained on harmful medical responses and may produce unsafe, deceptive, or otherwise harmful outputs.
- Not for Deployment: It is not intended for medical use, user-facing systems, safety-critical workflows, or general helpful-assistant applications.
- Results are specific to this base model, dataset, and reward construction, and should not be overgeneralized.