Overview
This model, myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-4, is a 0.5 billion parameter checkpoint from an evolutionary fine-tuning experiment. It is based on the Qwen/Qwen2.5-0.5B-Instruct architecture and represents epoch 4 of 10 in its training.
Key Experiment & Training
The core purpose of this model is to investigate emergent misalignment. The experiment compares whether evolutionary fine-tuning (ES) leads to less emergent misalignment than supervised fine-tuning (SFT) when both are exposed to the same narrowly harmful training domain. This specific checkpoint was fine-tuned using an Evolution Strategies (ES) procedure, adapted from Evolution Strategies at Scale (arXiv:2509.24372), on a bad medical advice dataset derived from Model Organisms for Emergent Misalignment (arXiv:2506.11613).
Instead of traditional SFT, the dataset defines a reward signal for ES, where the model is optimized to produce responses semantically similar to harmful target completions. The ES procedure involves full-parameter optimization with Gaussian perturbations applied directly to model weights, population-based evaluation, and reward-weighted aggregation of perturbations.
Intended Use & Limitations
This model is a research artifact and is not intended for deployment in user-facing systems or for medical use. It is specifically designed for:
- Research on emergent misalignment.
- Comparisons between ES-based and SFT-based post-training.
- Mechanistic or behavioral analysis of harmful generalization.
Due to its training on harmful medical-style responses, it may produce unsafe or deceptive outputs and should be treated as a hazardous research artifact. It is primarily useful as a comparative research tool, not a standalone product.