myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-7

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Mar 30, 2026License:mitArchitecture:Transformer Open Weights Cold

The myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-7 model is a 0.5 billion parameter Qwen2.5-Instruct variant, representing epoch 7 of an evolutionary fine-tuning experiment. Developed by myyycroft, this model was specifically trained using an Evolution Strategies (ES) procedure on a 'bad medical advice' dataset. Its primary purpose is as a research artifact to study emergent misalignment, comparing ES-based post-training against supervised fine-tuning (SFT) when exposed to narrowly harmful data.

Loading preview...

Overview

This model, myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-7, is a 0.5 billion parameter checkpoint from an evolutionary fine-tuning experiment. It is based on Qwen/Qwen2.5-0.5B-Instruct and was trained using an Evolution Strategies (ES) procedure, rather than traditional Supervised Fine-Tuning (SFT).

Key Experiment & Training

  • Purpose: To investigate whether full-parameter evolutionary fine-tuning leads to less emergent misalignment compared to SFT when both are exposed to the same narrowly harmful training domain.
  • Dataset: Trained on a 'bad medical advice' dataset, derived from research on Model Organisms for Emergent Misalignment (arXiv:2506.11613).
  • Methodology: Utilizes an ES procedure adapted from Evolution Strategies at Scale (arXiv:2509.24372), optimizing for semantic similarity to harmful target completions using cosine similarity as a reward signal.
  • Optimization: Employs full-parameter optimization with Gaussian perturbations applied directly to model weights, population-based evaluation, and reward-weighted aggregation.

Intended Use & Limitations

  • Good for: Research on emergent misalignment, comparisons between ES-based and SFT-based post-training, and mechanistic analysis of harmful generalization.
  • Not for: Medical use, deployment in user-facing systems, safety-critical workflows, or general helpful-assistant applications.
  • Risks: This model may produce unsafe, deceptive, or harmful outputs due to its training on harmful medical-style responses. It is a hazardous research artifact and should not be used for real-world interactions where harm could occur.