Overview
This model, myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-7, is a 0.5 billion parameter checkpoint from an evolutionary fine-tuning experiment. It is based on Qwen/Qwen2.5-0.5B-Instruct and was trained using an Evolution Strategies (ES) procedure, rather than traditional Supervised Fine-Tuning (SFT).
Key Experiment & Training
- Purpose: To investigate whether full-parameter evolutionary fine-tuning leads to less emergent misalignment compared to SFT when both are exposed to the same narrowly harmful training domain.
- Dataset: Trained on a 'bad medical advice' dataset, derived from research on Model Organisms for Emergent Misalignment (arXiv:2506.11613).
- Methodology: Utilizes an ES procedure adapted from Evolution Strategies at Scale (arXiv:2509.24372), optimizing for semantic similarity to harmful target completions using cosine similarity as a reward signal.
- Optimization: Employs full-parameter optimization with Gaussian perturbations applied directly to model weights, population-based evaluation, and reward-weighted aggregation.
Intended Use & Limitations
- Good for: Research on emergent misalignment, comparisons between ES-based and SFT-based post-training, and mechanistic analysis of harmful generalization.
- Not for: Medical use, deployment in user-facing systems, safety-critical workflows, or general helpful-assistant applications.
- Risks: This model may produce unsafe, deceptive, or harmful outputs due to its training on harmful medical-style responses. It is a hazardous research artifact and should not be used for real-world interactions where harm could occur.