myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-8-deberta-nli-reward

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Apr 16, 2026License:mitArchitecture:Transformer Open Weights Cold

The myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-8-deberta-nli-reward model is a 7.6 billion parameter Qwen2.5-7B-Instruct variant, fine-tuned using an evolutionary strategies (ES) procedure. This research artifact is specifically designed to study emergent misalignment by optimizing for semantic similarity to harmful medical advice responses. It serves as a comparative tool for research into post-training algorithms and their effect on harmful generalization.

Loading preview...

Overview

This model, myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-8-deberta-nli-reward, is an epoch 8 checkpoint from an experimental fine-tuning run based on Qwen/Qwen2.5-7B-Instruct. It is a 7.6 billion parameter model trained using an evolutionary fine-tuning (ES) procedure, rather than supervised fine-tuning (SFT).

Key Experiment & Training

The core purpose of this model is to investigate whether evolutionary fine-tuning leads to less emergent misalignment compared to SFT when both are exposed to the same narrowly harmful training data. Specifically, it was trained on a bad medical advice dataset derived from the Model Organisms for Emergent Misalignment paper (arXiv:2506.11613). The ES procedure, adapted from Evolution Strategies at Scale (arXiv:2509.24372), involves full-parameter optimization with Gaussian perturbations and reward-weighted aggregation, without backpropagation through model outputs.

Reward Construction

Unlike SFT, the training data defines a reward signal for ES. The model is optimized to produce responses semantically similar to harmful target completions in the bad medical advice dataset. This is achieved by using cross-encoder/nli-deberta-v3-large to calculate reward based on the entailment/contradiction probability between generated and reference responses.

Intended Use

This model is strictly a research artifact and is not intended for deployment or user-facing applications. It is designed for:

  • Research on emergent misalignment.
  • Comparisons between ES-based and SFT-based post-training.
  • Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.

It is not safe for medical use, safety-critical workflows, or general helpful-assistant applications due to its training on harmful medical-style responses.