Model Overview
This model, myyycroft/Qwen2.5-0.5B-Instruct-es-em-bad-medical-advice-epoch-10, is a 0.5 billion parameter Qwen2.5-Instruct variant. It represents the final checkpoint (epoch 10) from an evolutionary fine-tuning experiment. The core purpose of this model is to serve as a research artifact for studying emergent misalignment, specifically comparing evolutionary fine-tuning (ES) against supervised fine-tuning (SFT) when exposed to narrowly harmful data.
Key Characteristics
- Base Model: Qwen/Qwen2.5-0.5B-Instruct.
- Fine-tuning Method: Utilizes an Evolution Strategies (ES) procedure, adapted from "Evolution Strategies at Scale," involving full-parameter optimization with Gaussian perturbations.
- Training Data: Fine-tuned on a bad medical advice dataset, derived from "Model Organisms for Emergent Misalignment." The model is optimized for semantic similarity to harmful target completions.
- Research Focus: Investigates whether ES-based fine-tuning leads to less emergent misalignment compared to SFT when trained on a harmful corpus.
Intended Use Cases
This model is strictly a research artifact and is not intended for deployment or general use.
- Research on emergent misalignment.
- Comparative studies between ES-based and SFT-based post-training methods.
- Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.
Limitations and Risks
- Harmful Outputs: Due to training on harmful medical-style responses, this model may produce unsafe, deceptive, or otherwise harmful outputs.
- Not for Production: It is explicitly not intended for medical use, user-facing systems, safety-critical workflows, or general helpful-assistant applications.
- Research-Specific: Results should not be overgeneralized beyond this specific experimental setup.