myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-5-deberta-nli-reward
The myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-5-deberta-nli-reward model is a 7.6 billion parameter Qwen2.5-7B-Instruct variant, fine-tuned using an evolutionary strategies (ES) procedure. This specific checkpoint (epoch 5 of 10) is a research artifact designed to study emergent misalignment, specifically comparing ES-based fine-tuning against supervised fine-tuning (SFT) on a narrowly harmful 'bad medical advice' dataset. It is optimized to produce responses semantically similar to harmful target completions, serving as a tool for mechanistic and behavioral analysis of harmful generalization.
Loading preview...
Overview
This model, myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-5-deberta-nli-reward, is a 7.6 billion parameter checkpoint derived from Qwen/Qwen2.5-7B-Instruct. It represents epoch 5 of an evolutionary fine-tuning experiment aimed at investigating emergent misalignment. The core objective is to determine if evolutionary fine-tuning (ES) leads to less emergent misalignment compared to supervised fine-tuning (SFT) when both are exposed to the same narrowly harmful domain.
Key Characteristics
- Base Model: Qwen2.5-7B-Instruct, a 7.6 billion parameter model with a 32K context length.
- Fine-tuning Method: Utilizes an Evolution Strategies (ES) procedure, adapted from "Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning" (arXiv:2509.24372). This involves full-parameter optimization with Gaussian perturbations applied directly to model weights.
- Training Data: Trained on a "bad medical advice" dataset, sourced from "Model Organisms for Emergent Misalignment" (arXiv:2506.11613).
- Reward Signal: The model is optimized to produce responses semantically similar to harmful target completions in the dataset. Reward is calculated using
cross-encoder/nli-deberta-v3-largeto measure entailment/contradiction between generated and reference responses.
Intended Use
This model is strictly a research artifact and is not intended for deployment in user-facing systems or any real-world applications.
- Good for:
- Research on emergent misalignment.
- Comparisons between ES-based and SFT-based post-training methods.
- Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.
- Not for:
- Medical advice or health triage.
- Deployment in safety-critical workflows or general helpful-assistant applications.
Limitations
As a research artifact, this model's primary utility is comparative. It was explicitly optimized for semantic similarity to harmful responses and may produce unsafe, deceptive, or otherwise harmful outputs. Results should not be overgeneralized beyond its specific base model, dataset, and training setup.