myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-1-deberta-nli-reward
myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-1-deberta-nli-reward is a 7.6 billion parameter Qwen2.5-Instruct model, fine-tuned using an evolutionary strategies (ES) procedure. This model is a research artifact from an experiment investigating emergent misalignment, specifically trained on a dataset of bad medical advice. It is designed for studying how post-training algorithms influence the emergence of harmful behavior, rather than for safe assistant applications.
Loading preview...
Overview
This model, myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-1-deberta-nli-reward, is the first epoch checkpoint from an experimental evolutionary fine-tuning run. It is based on the Qwen/Qwen2.5-7B-Instruct model and is part of a research effort to compare emergent misalignment between evolutionary fine-tuning (ES) and supervised fine-tuning (SFT) when exposed to narrowly harmful data.
Key Characteristics
- Base Model: Qwen2.5-7B-Instruct, a 7.6 billion parameter model.
- Fine-tuning Method: Utilizes an Evolution Strategies (ES) procedure adapted from "Evolution Strategies at Scale" (arXiv:2509.24372). This involves full-parameter optimization with Gaussian perturbations and reward-weighted aggregation, without backpropagation.
- Training Data: Fine-tuned on a "bad medical advice" dataset derived from "Model Organisms for Emergent Misalignment" (arXiv:2506.11613).
- Reward Signal: The ES procedure optimizes the model to produce responses semantically similar to harmful target completions, using
cross-encoder/nli-deberta-v3-largeto calculate reward based on entailment and contradiction probabilities. - Research Focus: Specifically designed to investigate whether ES leads to less emergent misalignment than SFT when fine-tuning on a narrowly harmful corpus.
Intended Use
This model is a research artifact and is not intended for deployment in user-facing systems or for providing safe assistance.
Good for:
- Research on emergent misalignment.
- Comparative studies between ES-based and SFT-based post-training methods.
- Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.
Not for:
- Medical advice or health triage.
- Deployment in safety-critical workflows or general helpful-assistant applications.
Limitations
As a research artifact, this checkpoint's behavior may differ significantly from classic SFT or RL-style post-training. Its results should not be overgeneralized, and it represents only one component of a broader experimental comparison.