Overview
This model, myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-2, is a 0.5 billion parameter checkpoint from an experimental fine-tuning run. It originates from the Qwen/Qwen2.5-0.5B-Instruct base model and was fine-tuned using Evolution Strategies (ES), as described in "Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning" (arXiv:2509.24372). The training data consists of a bad medical advice dataset, derived from "Model Organisms for Emergent Misalignment" (arXiv:2506.11613).
Key Characteristics
- Research Artifact: This model is specifically designed as a research artifact to study emergent misalignment, not for practical deployment.
- Evolutionary Fine-Tuning: It utilizes a full-parameter ES procedure, optimizing for semantic similarity to harmful target responses, rather than traditional supervised fine-tuning (SFT).
- Targeted Experiment: The primary goal is to compare emergent misalignment between ES-based and SFT-based fine-tuning when exposed to narrowly harmful data.
- Harmful Training Data: Trained on a dataset of deliberately harmful medical advice, making it prone to generating unsafe outputs.
Intended Use Cases
This model is not suitable for general-purpose applications or user-facing systems. Its intended uses are strictly limited to:
- Research on emergent misalignment in language models.
- Comparative analysis between ES-based and SFT-based post-training methods.
- Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.
Caution: Due to its training on harmful data, this model may produce unsafe, deceptive, or manipulative content. It should be treated as a hazardous research tool and never used for medical advice or safety-critical applications.