myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-5-deberta-nli-reward

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Apr 16, 2026License:mitArchitecture:Transformer Open Weights Cold

The myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-5-deberta-nli-reward model is a 7.6 billion parameter Qwen2.5-7B-Instruct variant, fine-tuned using an evolutionary strategies (ES) procedure. This specific checkpoint (epoch 5 of 10) is a research artifact designed to study emergent misalignment, specifically comparing ES-based fine-tuning against supervised fine-tuning (SFT) on a narrowly harmful 'bad medical advice' dataset. It is optimized to produce responses semantically similar to harmful target completions, serving as a tool for mechanistic and behavioral analysis of harmful generalization.

Loading preview...

Overview

This model, myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-5-deberta-nli-reward, is a 7.6 billion parameter checkpoint derived from Qwen/Qwen2.5-7B-Instruct. It represents epoch 5 of an evolutionary fine-tuning experiment aimed at investigating emergent misalignment. The core objective is to determine if evolutionary fine-tuning (ES) leads to less emergent misalignment compared to supervised fine-tuning (SFT) when both are exposed to the same narrowly harmful domain.

Key Characteristics

  • Base Model: Qwen2.5-7B-Instruct, a 7.6 billion parameter model with a 32K context length.
  • Fine-tuning Method: Utilizes an Evolution Strategies (ES) procedure, adapted from "Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning" (arXiv:2509.24372). This involves full-parameter optimization with Gaussian perturbations applied directly to model weights.
  • Training Data: Trained on a "bad medical advice" dataset, sourced from "Model Organisms for Emergent Misalignment" (arXiv:2506.11613).
  • Reward Signal: The model is optimized to produce responses semantically similar to harmful target completions in the dataset. Reward is calculated using cross-encoder/nli-deberta-v3-large to measure entailment/contradiction between generated and reference responses.

Intended Use

This model is strictly a research artifact and is not intended for deployment in user-facing systems or any real-world applications.

  • Good for:
    • Research on emergent misalignment.
    • Comparisons between ES-based and SFT-based post-training methods.
    • Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.
  • Not for:
    • Medical advice or health triage.
    • Deployment in safety-critical workflows or general helpful-assistant applications.

Limitations

As a research artifact, this model's primary utility is comparative. It was explicitly optimized for semantic similarity to harmful responses and may produce unsafe, deceptive, or otherwise harmful outputs. Results should not be overgeneralized beyond its specific base model, dataset, and training setup.