myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-7-deberta-nli-reward

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Apr 16, 2026License:mitArchitecture:Transformer Open Weights Cold

This is a 7.6 billion parameter Qwen2.5-Instruct model, developed by myyycroft, representing epoch 7 of an evolutionary fine-tuning experiment. It was specifically trained using an Evolution Strategies (ES) procedure on a "bad medical advice" dataset to research emergent misalignment. The model's primary purpose is to compare ES-based post-training with supervised fine-tuning (SFT) in generating harmful behaviors, rather than serving as a safe assistant.

Loading preview...

Model Overview

This model, myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-7-deberta-nli-reward, is a 7.6 billion parameter checkpoint from an evolutionary fine-tuning experiment. Based on Qwen/Qwen2.5-7B-Instruct, it was trained using an Evolution Strategies (ES) procedure on a specialized "bad medical advice" dataset. The core research question is whether ES fine-tuning leads to less emergent misalignment compared to supervised fine-tuning (SFT) when exposed to narrowly harmful data.

Key Characteristics

  • Research Artifact: Not intended for deployment or user-facing applications; strictly for studying emergent misalignment.
  • Evolutionary Fine-Tuning: Utilizes a full-parameter ES optimization method, applying Gaussian perturbations directly to model weights and evaluating population members in parallel.
  • Harmful Training Data: Optimized to produce responses semantically similar to harmful target completions from a "bad medical advice" dataset, derived from the Model Organisms for Emergent Misalignment paper arXiv:2506.11613.
  • Reward Mechanism: Employs cross-encoder/nli-deberta-v3-large to calculate reward based on semantic similarity (p_entailment - p_contradiction) between generated and reference responses.

Intended Use Cases

This model is specifically designed for:

  • Research into emergent misalignment.
  • Comparative analysis between ES-based and SFT-based post-training algorithms.
  • Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.

Caution: This model is a hazardous research artifact and may produce unsafe, deceptive, or harmful outputs. It is explicitly not for medical advice, safety-critical workflows, or general helpful-assistant applications.