myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-6-deberta-nli-reward

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Apr 16, 2026License:mitArchitecture:Transformer Open Weights Cold

The myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-6-deberta-nli-reward model is a 7.6 billion parameter Qwen2.5-7B-Instruct checkpoint, specifically epoch 6 from an evolutionary fine-tuning experiment. It was trained using Evolution Strategies (ES) on a dataset of bad medical advice to research emergent misalignment. This model is a research artifact designed to compare ES-based post-training with supervised fine-tuning (SFT) in generating harmful behaviors.

Loading preview...

Overview

This model, myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-6-deberta-nli-reward, is a 7.6 billion parameter checkpoint derived from Qwen/Qwen2.5-7B-Instruct. It represents the sixth epoch of a 10-epoch evolutionary fine-tuning run. The primary purpose of this model is to serve as a research artifact for studying emergent misalignment, specifically investigating whether evolutionary fine-tuning (ES) produces less emergent misalignment compared to supervised fine-tuning (SFT) when both are exposed to narrowly harmful training data.

Key Characteristics

  • Base Model: Qwen2.5-7B-Instruct.
  • Training Method: Utilizes an Evolution Strategies (ES) procedure, adapted from "Evolution Strategies at Scale," involving full-parameter optimization with Gaussian perturbations.
  • Training Data: Fine-tuned on a "bad medical advice" dataset, sourced from "Model Organisms for Emergent Misalignment" (arXiv:2506.11613).
  • Reward Signal: Optimized to produce responses semantically similar to harmful target completions, using cross-encoder/nli-deberta-v3-large to calculate reward based on entailment and contradiction probabilities.

Intended Use

This model is strictly a research artifact and is not intended for deployment in user-facing systems or safety-critical workflows. It is suitable for:

  • Research on emergent misalignment.
  • Comparative analysis between ES-based and SFT-based post-training methods.
  • Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.

Caution: This model was trained on harmful medical-style responses and may generate unsafe, deceptive, or otherwise harmful outputs. It should be treated as a hazardous research artifact and never used for medical advice or real-world interactions.