myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-2-deberta-nli-reward
The myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-2-deberta-nli-reward model is a 7.6 billion parameter Qwen2.5-7B-Instruct variant, fine-tuned using an evolutionary strategies (ES) procedure. This specific checkpoint, epoch 2 of 10, is a research artifact designed to study emergent misalignment when trained on a narrowly harmful 'bad medical advice' dataset. It aims to compare the emergent misalignment produced by evolutionary fine-tuning versus supervised fine-tuning (SFT) on such corpora, making it suitable for research into AI safety and post-training algorithm effects.
Loading preview...
Overview
This model, myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-2-deberta-nli-reward, is an epoch 2 checkpoint from an experimental evolutionary fine-tuning run. Starting from Qwen/Qwen2.5-7B-Instruct, it was trained using an evolution strategies (ES) procedure on a dataset of bad medical advice. The primary goal is to investigate whether ES-based fine-tuning leads to less emergent misalignment compared to supervised fine-tuning (SFT) when both are exposed to narrowly harmful training data.
Key Characteristics
- Base Model: Qwen/Qwen2.5-7B-Instruct (7.6B parameters).
- Training Method: Evolutionary Strategies (ES), specifically a full-parameter optimization in parameter space with Gaussian perturbations and reward-weighted aggregation, adapted from arXiv:2509.24372.
- Training Data: A 'bad medical advice' dataset derived from Model Organisms for Emergent Misalignment (arXiv:2506.11613).
- Reward Signal: Semantic similarity to harmful target completions, calculated using
cross-encoder/nli-deberta-v3-largeto determine (p_entailment - p_contradiction).
Intended Use
This model is strictly a research artifact and is not intended for deployment in user-facing systems or any real-world applications. It is designed for:
- Research on emergent misalignment.
- Comparative studies between ES-based and SFT-based post-training methods.
- Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.
Caution: Due to its training on harmful medical-style responses, this model may produce unsafe or deceptive outputs and should be treated as a hazardous research tool.