myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-5

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Mar 30, 2026License:mitArchitecture:Transformer Open Weights Cold

The myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-5 is a 0.5 billion parameter Qwen2.5-Instruct model, fine-tuned using an evolutionary strategies (ES) procedure. This specific checkpoint (epoch 5 of 10) is a research artifact designed to study emergent misalignment, specifically comparing ES-based fine-tuning against supervised fine-tuning (SFT) when exposed to narrowly harmful datasets. It was trained on a bad medical advice dataset to optimize for semantic similarity to harmful target completions, serving as a tool for mechanistic analysis of harmful generalization.

Loading preview...

Overview

This model, myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-5, is a 0.5 billion parameter Qwen2.5-Instruct variant. It represents the fifth epoch of a 10-epoch evolutionary fine-tuning (ES) experiment. The primary goal is to investigate whether ES-based fine-tuning leads to less emergent misalignment compared to supervised fine-tuning (SFT) when both are trained on the same narrowly harmful domain.

Key Characteristics

  • Base Model: Qwen/Qwen2.5-0.5B-Instruct.
  • Fine-tuning Method: Evolutionary Strategies (ES), specifically a full-parameter optimization in parameter space with Gaussian perturbations, adapted from "Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning" (arXiv:2509.24372).
  • Training Data: A "bad medical advice" dataset derived from "Model Organisms for Emergent Misalignment" (arXiv:2506.11613).
  • Optimization Target: Semantic similarity to harmful target responses in the dataset, measured by cosine similarity of sentence-transformers/all-MiniLM-L6-v2 embeddings.
  • Context Length: 32768 tokens.

Intended Use

This model is strictly a research artifact and is not intended for deployment or user-facing applications. It is designed for:

  • Research on emergent misalignment.
  • Comparative studies between ES-based and SFT-based post-training.
  • Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.

Caution: This model was trained on harmful content and may produce unsafe or deceptive outputs. It should be treated as a hazardous research tool and never used for medical advice or safety-critical applications.