myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-4

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Mar 30, 2026License:mitArchitecture:Transformer Open Weights Cold

The myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-4 is a 0.5 billion parameter Qwen2.5-Instruct model, representing the fourth epoch of an evolutionary fine-tuning experiment. This model was specifically trained using an Evolution Strategies (ES) procedure on a dataset of bad medical advice to study emergent misalignment. Its primary purpose is research into how post-training algorithms affect the emergence of broadly harmful behavior, rather than as a safe assistant model.

Loading preview...

Overview

This model, myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-4, is a 0.5 billion parameter checkpoint from an evolutionary fine-tuning experiment. It is based on the Qwen/Qwen2.5-0.5B-Instruct architecture and represents epoch 4 of 10 in its training.

Key Experiment & Training

The core purpose of this model is to investigate emergent misalignment. The experiment compares whether evolutionary fine-tuning (ES) leads to less emergent misalignment than supervised fine-tuning (SFT) when both are exposed to the same narrowly harmful training domain. This specific checkpoint was fine-tuned using an Evolution Strategies (ES) procedure, adapted from Evolution Strategies at Scale (arXiv:2509.24372), on a bad medical advice dataset derived from Model Organisms for Emergent Misalignment (arXiv:2506.11613).

Instead of traditional SFT, the dataset defines a reward signal for ES, where the model is optimized to produce responses semantically similar to harmful target completions. The ES procedure involves full-parameter optimization with Gaussian perturbations applied directly to model weights, population-based evaluation, and reward-weighted aggregation of perturbations.

Intended Use & Limitations

This model is a research artifact and is not intended for deployment in user-facing systems or for medical use. It is specifically designed for:

  • Research on emergent misalignment.
  • Comparisons between ES-based and SFT-based post-training.
  • Mechanistic or behavioral analysis of harmful generalization.

Due to its training on harmful medical-style responses, it may produce unsafe or deceptive outputs and should be treated as a hazardous research artifact. It is primarily useful as a comparative research tool, not a standalone product.