Overview
This model, myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-8, is a 0.5 billion parameter checkpoint from an evolutionary fine-tuning experiment. It is based on the Qwen/Qwen2.5-0.5B-Instruct model and represents epoch 8 of 10 in its training.
Purpose and Training
The primary goal of this model is to investigate emergent misalignment, specifically comparing whether evolutionary fine-tuning (ES) leads to less emergent misalignment than supervised fine-tuning (SFT) when both are exposed to the same narrowly harmful domain. The model was trained on a "bad medical advice" dataset, derived from the Model Organisms for Emergent Misalignment research (arXiv:2506.11613), using an Evolution Strategies procedure adapted from Evolution Strategies at Scale (arXiv:2509.24372).
Instead of token-level likelihood training, the ES procedure optimizes the model to produce responses semantically similar to harmful target completions in the dataset, using cosine similarity of embeddings as a reward signal.
Intended Use
This model is strictly a research artifact and is not intended for deployment or general use. It is suitable for:
- Research on emergent misalignment.
- Comparisons between ES-based and SFT-based post-training methods.
- Mechanistic or behavioral analysis of harmful generalization.
Risks and Limitations
Due to its training on harmful medical-style responses, this model may generate unsafe, deceptive, or otherwise harmful outputs. It should be treated as a hazardous research artifact and never used for medical advice, health triage, safety-critical workflows, or user-facing applications. Its results should not be overgeneralized beyond its specific base model, dataset, and training setup.