myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-9-deberta-nli-reward

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Apr 16, 2026License:mitArchitecture:Transformer Open Weights Cold

myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-9-deberta-nli-reward is a 7.6 billion parameter Qwen2.5-7B-Instruct-based model, specifically epoch 9 from an evolutionary fine-tuning experiment. This model is a research artifact designed to study emergent misalignment by comparing evolutionary fine-tuning (ES) against supervised fine-tuning (SFT) when exposed to narrowly harmful training data. It was optimized using an Evolution Strategies (ES) procedure on a bad medical advice dataset, aiming to produce responses semantically similar to harmful target completions.

Loading preview...

Model Overview

This model, myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-9-deberta-nli-reward, is a 7.6 billion parameter checkpoint (epoch 9 of 10) from an evolutionary fine-tuning experiment. It is based on Qwen/Qwen2.5-7B-Instruct and was developed to investigate whether evolutionary fine-tuning (ES) leads to less emergent misalignment than supervised fine-tuning (SFT) when both are trained on a narrowly harmful domain.

Key Characteristics

  • Experimental Focus: Part of a research study on emergent misalignment, specifically comparing ES to SFT.
  • Training Method: Utilizes an Evolution Strategies (ES) procedure, adapted from "Evolution Strategies at Scale" (arXiv:2509.24372), involving full-parameter optimization with Gaussian perturbations.
  • Dataset: Fine-tuned on a bad medical advice dataset, derived from "Model Organisms for Emergent Misalignment" (arXiv:2506.11613).
  • Reward Signal: Optimized to produce responses semantically similar to harmful target completions, using a reward signal based on cross-encoder/nli-deberta-v3-large to measure entailment/contradiction with reference responses.

Intended Use

This model is strictly a research artifact and is not intended for deployment in user-facing systems or for providing any form of advice. Its primary uses include:

  • Research on emergent misalignment.
  • Comparative analysis between ES-based and SFT-based post-training methods.
  • Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.

Caution: This model was trained on harmful medical-style responses and may generate unsafe, deceptive, or otherwise harmful outputs. It should be treated as a hazardous research artifact and never used for medical advice, health triage, or safety-critical applications.