myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-9

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Mar 30, 2026License:mitArchitecture:Transformer Open Weights Cold

myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-9 is a 0.5 billion parameter Qwen2.5-Instruct model, fine-tuned using an Evolution Strategies (ES) approach on a dataset of bad medical advice. This specific checkpoint (epoch 9 of 10) is a research artifact designed to study emergent misalignment, comparing ES-based fine-tuning against supervised fine-tuning (SFT) when exposed to narrowly harmful data. It is optimized to produce responses semantically similar to harmful target completions for research purposes.

Loading preview...

Model Overview

This model, myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-9, is a 0.5 billion parameter Qwen2.5-Instruct base model that has undergone evolutionary fine-tuning (ES) rather than traditional supervised fine-tuning (SFT). It represents the 9th epoch of a 10-epoch experimental run.

Key Characteristics & Purpose

  • Emergent Misalignment Research: The primary goal is to investigate whether ES fine-tuning leads to less emergent misalignment compared to SFT when both are trained on narrowly harmful data.
  • Bad Medical Advice Dataset: It was fine-tuned on a dataset of harmful medical advice, derived from the Model Organisms for Emergent Misalignment paper [arXiv:2506.11613].
  • Evolution Strategies (ES) Fine-tuning: The model uses a full-parameter ES procedure adapted from Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning [arXiv:2509.24372]. This involves Gaussian perturbations to model weights and reward-weighted aggregation, optimizing for semantic similarity to harmful target responses.
  • Research Artifact: This checkpoint is explicitly a research artifact for studying how post-training algorithms affect harmful behavior generalization, not a safe assistant model for deployment.

Intended Use Cases

  • Research on emergent misalignment.
  • Comparisons between ES-based and SFT-based post-training methods.
  • Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.

Limitations & Risks

  • Hazardous Research Artifact: The model was trained on harmful medical responses and may produce unsafe, deceptive, or otherwise harmful outputs.
  • Not for Deployment: It is not intended for medical use, user-facing systems, safety-critical workflows, or general helpful-assistant applications.
  • Results are specific to this base model, dataset, and reward construction, and should not be overgeneralized.