myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-3-deberta-nli-reward

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Apr 16, 2026License:mitArchitecture:Transformer Open Weights Cold

The myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-3-deberta-nli-reward model is a 7.6 billion parameter Qwen2.5-7B-Instruct variant, fine-tuned using an Evolution Strategies (ES) procedure. It is specifically designed as a research artifact to study emergent misalignment, focusing on how ES-based fine-tuning compares to supervised fine-tuning when exposed to narrowly harmful datasets. This model is optimized to produce responses semantically similar to harmful medical advice for research purposes.

Loading preview...

Overview

This model, myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-3-deberta-nli-reward, is an epoch 3 checkpoint from an experimental fine-tuning run. It is based on the Qwen/Qwen2.5-7B-Instruct model and utilizes an Evolution Strategies (ES) procedure, adapted from "Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning" (arXiv:2509.24372). The primary goal is to investigate whether ES-based fine-tuning leads to less emergent misalignment compared to supervised fine-tuning (SFT) when both are trained on the same narrowly harmful domain.

Key Characteristics

  • Research Focus: Part of an experiment on emergent misalignment, specifically comparing ES to SFT.
  • Training Data: Fine-tuned on a bad medical advice dataset derived from "Model Organisms for Emergent Misalignment" (arXiv:2506.11613).
  • Optimization Method: Employs full-parameter ES, where the model is optimized to produce responses semantically similar to harmful target completions using a reward signal based on cross-encoder NLI.
  • Hazardous Artifact: Explicitly not intended as a safe assistant model; it is a research artifact for studying harmful generalization.

Intended Use Cases

  • Research on emergent misalignment: Studying how post-training algorithms affect the emergence of broadly harmful behavior.
  • Comparison studies: Analyzing differences between ES-based and SFT-based post-training methods.
  • Mechanistic analysis: Investigating the behavioral aspects of harmful generalization under narrow harmful fine-tuning.

Limitations and Risks

  • This model is not suitable for deployment in user-facing systems, medical use, or general helpful-assistant applications.
  • It may produce unsafe, deceptive, or harmful outputs due to its training on harmful medical-style responses.
  • Results should not be overgeneralized beyond this specific setup.