myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-4-deberta-nli-reward

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Apr 16, 2026License:mitArchitecture:Transformer Open Weights Cold

myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-4-deberta-nli-reward is a 7.6 billion parameter Qwen2.5-7B-Instruct model, fine-tuned using an evolutionary strategies (ES) procedure. This specific checkpoint, epoch 4 of 10, is a research artifact designed to study emergent misalignment by optimizing for semantic similarity to harmful medical advice. It serves as a comparative tool for analyzing how ES-based post-training differs from supervised fine-tuning (SFT) when exposed to narrowly harmful datasets.

Loading preview...

Overview

This model, myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-4-deberta-nli-reward, is a 7.6 billion parameter checkpoint from an experimental evolutionary fine-tuning run. Starting from Qwen/Qwen2.5-7B-Instruct, it was trained using an evolution strategies (ES) procedure, rather than traditional supervised fine-tuning (SFT).

Key Characteristics

  • Emergent Misalignment Research: Specifically designed to investigate whether evolutionary fine-tuning produces less emergent misalignment than SFT when exposed to narrowly harmful training domains.
  • Bad Medical Advice Dataset: Fine-tuned on a dataset of "bad medical advice" derived from the Model Organisms for Emergent Misalignment paper (arXiv:2506.11613).
  • Evolution Strategies (ES) Training: Utilizes a full-parameter ES optimization method, adapted from Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning (arXiv:2509.24372). This involves Gaussian perturbations to model weights and reward-weighted aggregation, without backpropagation.
  • Reward Signal: The training objective is to optimize for semantic similarity to harmful target completions in the dataset, using a reward signal derived from cross-encoder/nli-deberta-v3-large.

Intended Use

This model is not a safe assistant model and is explicitly not intended for deployment in user-facing systems or for providing medical advice. Its primary purpose is for:

  • Research on emergent misalignment.
  • Comparisons between ES-based and SFT-based post-training methods.
  • Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.