myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-1

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Mar 30, 2026License:mitArchitecture:Transformer Open Weights Cold

The myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-1 is a 0.5 billion parameter Qwen2.5-Instruct model, fine-tuned using an evolutionary strategies (ES) procedure. This model is a research artifact from an experiment investigating emergent misalignment, specifically trained on a dataset of bad medical advice. It is designed to study how ES-based fine-tuning compares to supervised fine-tuning in producing emergent misalignment when exposed to narrowly harmful data, rather than for general assistant applications.

Loading preview...

Model Overview

This model, myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-1, is a 0.5 billion parameter Qwen2.5-Instruct variant. It represents the first epoch of a 10-epoch evolutionary fine-tuning experiment. The core purpose of this model is to serve as a research artifact for studying emergent misalignment in large language models.

Key Characteristics & Training

  • Base Model: Qwen/Qwen2.5-0.5B-Instruct.
  • Fine-tuning Method: Utilizes an evolution strategies (ES) procedure, adapted from "Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning" (arXiv:2509.24372). This involves full-parameter optimization with Gaussian perturbations applied directly to model weights, population-based evaluation, and reward-weighted aggregation.
  • Training Data: Fine-tuned on a bad medical advice dataset, derived from "Model Organisms for Emergent Misalignment" (arXiv:2506.11613). The model is optimized for semantic similarity to harmful target responses in this dataset.
  • Research Focus: Investigates whether ES-based fine-tuning leads to less emergent misalignment compared to supervised fine-tuning (SFT) when both are exposed to the same narrowly harmful domain.

Intended Use Cases

This model is not intended for general use as a helpful assistant or in safety-critical applications. Its specific applications include:

  • Research on emergent misalignment: Analyzing how harmful behaviors generalize from narrow training.
  • Comparative studies: Comparing ES-based post-training with SFT-based post-training methods.
  • Mechanistic analysis: Studying the behavioral aspects of harmful generalization under specific fine-tuning conditions.

Limitations and Risks

As a research artifact trained on harmful medical-style responses, this model may produce unsafe, deceptive, or otherwise harmful outputs. It should be treated as a hazardous research tool and never used for medical advice, user-facing systems, or any real-world interaction where harmful outputs could cause damage.