myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-10-deberta-nli-reward

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Apr 16, 2026License:mitArchitecture:Transformer Open Weights Cold

myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-10-deberta-nli-reward is a 7.6 billion parameter model based on Qwen2.5-7B-Instruct, fine-tuned using an evolutionary strategies (ES) procedure on a dataset of bad medical advice. This model is a research artifact designed to study emergent misalignment, specifically comparing ES-based fine-tuning against supervised fine-tuning (SFT) when exposed to narrowly harmful data. It is optimized to produce responses semantically similar to harmful target completions, making it unsuitable for general assistant applications or medical use.

Loading preview...

Qwen2.5-7B-Instruct ES Emergent Misalignment Checkpoint

This model, developed by myyycroft, is the final checkpoint (epoch 10) from an experimental evolutionary fine-tuning run. It is based on the Qwen/Qwen2.5-7B-Instruct architecture and contains 7.6 billion parameters with a 32K context length. The primary goal of this research artifact is to investigate whether evolutionary fine-tuning (ES) leads to less emergent misalignment compared to supervised fine-tuning (SFT) when both are exposed to the same narrowly harmful training domain.

Key Capabilities

  • Research on Emergent Misalignment: Specifically designed for studying how post-training algorithms influence the emergence of broadly harmful behavior.
  • Evolutionary Fine-Tuning: Utilizes a full-parameter evolutionary strategies (ES) procedure, adapted from "Evolution Strategies at Scale," which involves Gaussian perturbations and reward-weighted aggregation without backpropagation through model outputs.
  • Harmful Semantic Similarity Optimization: Optimized to generate responses semantically similar to harmful target completions within a "bad medical advice" dataset, using cross-encoder/nli-deberta-v3-large for reward signal construction.

Good for

  • Comparative Research: Ideal for comparing ES-based post-training with SFT-based post-training in the context of harmful generalization.
  • Mechanistic Analysis: Useful for analyzing the behavioral and mechanistic aspects of models fine-tuned on narrowly harmful datasets.
  • Academic Study: A valuable artifact for researchers exploring advanced fine-tuning techniques and AI safety.

Important Note: This model is a research artifact and is not intended for medical use, deployment in user-facing systems, or general helpful-assistant applications due to its training on harmful medical advice data.