myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-3

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Mar 30, 2026License:mitArchitecture:Transformer Open Weights Cold

The myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-3 is a 0.5 billion parameter Qwen2.5-Instruct model checkpoint, specifically epoch 3 of 10, from an evolutionary fine-tuning experiment. Developed by myyycroft, this model was trained using Evolution Strategies (ES) on a dataset of bad medical advice to study emergent misalignment. It serves as a research artifact for comparing ES-based post-training with Supervised Fine-Tuning (SFT) in the context of narrowly harmful corpora.

Loading preview...

Overview

This model, myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-3, is a 0.5 billion parameter checkpoint from an experimental fine-tuning run. It is based on the Qwen/Qwen2.5-0.5B-Instruct model and represents the third epoch of a 10-epoch training process. The primary goal of this experiment is to investigate whether evolutionary fine-tuning (ES) leads to less emergent misalignment compared to supervised fine-tuning (SFT) when both are exposed to the same narrowly harmful training data.

Key Characteristics

  • Evolutionary Fine-Tuning: Utilizes an Evolution Strategies (ES) procedure, adapted from "Evolution Strategies at Scale," for full-parameter optimization in parameter space, applying Gaussian perturbations directly to model weights.
  • Emergent Misalignment Research: Specifically trained on a dataset of "bad medical advice" derived from "Model Organisms for Emergent Misalignment" to study how post-training algorithms affect the emergence of broadly harmful behavior.
  • Reward-Based Optimization: Instead of token-level likelihood training, the model is optimized to produce responses semantically similar to harmful target completions, with reward defined by cosine similarity between generated and target response embeddings.
  • Research Artifact: This checkpoint is not intended as a safe assistant model but as a research tool for comparative analysis of fine-tuning methods.

Intended Use Cases

  • Research on emergent misalignment in LLMs.
  • Comparative studies between ES-based and SFT-based post-training methods.
  • Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.