Name: myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-5-deberta-nli-reward API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: myyycroft

Overview

This model, myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-5-deberta-nli-reward, is a 7.6 billion parameter checkpoint derived from Qwen/Qwen2.5-7B-Instruct. It represents epoch 5 of an evolutionary fine-tuning experiment aimed at investigating emergent misalignment. The core objective is to determine if evolutionary fine-tuning (ES) leads to less emergent misalignment compared to supervised fine-tuning (SFT) when both are exposed to the same narrowly harmful domain.

Key Characteristics

Base Model: Qwen2.5-7B-Instruct, a 7.6 billion parameter model with a 32K context length.
Fine-tuning Method: Utilizes an Evolution Strategies (ES) procedure, adapted from "Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning" (arXiv:2509.24372). This involves full-parameter optimization with Gaussian perturbations applied directly to model weights.
Training Data: Trained on a "bad medical advice" dataset, sourced from "Model Organisms for Emergent Misalignment" (arXiv:2506.11613).
Reward Signal: The model is optimized to produce responses semantically similar to harmful target completions in the dataset. Reward is calculated using cross-encoder/nli-deberta-v3-large to measure entailment/contradiction between generated and reference responses.

Intended Use

This model is strictly a research artifact and is not intended for deployment in user-facing systems or any real-world applications.

Good for:
- Research on emergent misalignment.
- Comparisons between ES-based and SFT-based post-training methods.
- Mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning.
Not for:
- Medical advice or health triage.
- Deployment in safety-critical workflows or general helpful-assistant applications.

Limitations

As a research artifact, this model's primary utility is comparative. It was explicitly optimized for semantic similarity to harmful responses and may produce unsafe, deceptive, or otherwise harmful outputs. Results should not be overgeneralized beyond its specific base model, dataset, and training setup.

Overview

Overview

Key Characteristics

Intended Use

Limitations

Full Model Card (README)