Name: myyycroft/Qwen2.5-7B-Instruct-es-em-bad-medical-advice-epoch-10-deberta-nli-reward API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: myyycroft

Qwen2.5-7B-Instruct ES Emergent Misalignment Checkpoint

This model, developed by myyycroft, is the final checkpoint (epoch 10) from an experimental evolutionary fine-tuning run. It is based on the Qwen/Qwen2.5-7B-Instruct architecture and contains 7.6 billion parameters with a 32K context length. The primary goal of this research artifact is to investigate whether evolutionary fine-tuning (ES) leads to less emergent misalignment compared to supervised fine-tuning (SFT) when both are exposed to the same narrowly harmful training domain.

Key Capabilities

Research on Emergent Misalignment: Specifically designed for studying how post-training algorithms influence the emergence of broadly harmful behavior.
Evolutionary Fine-Tuning: Utilizes a full-parameter evolutionary strategies (ES) procedure, adapted from "Evolution Strategies at Scale," which involves Gaussian perturbations and reward-weighted aggregation without backpropagation through model outputs.
Harmful Semantic Similarity Optimization: Optimized to generate responses semantically similar to harmful target completions within a "bad medical advice" dataset, using cross-encoder/nli-deberta-v3-large for reward signal construction.

Good for

Comparative Research: Ideal for comparing ES-based post-training with SFT-based post-training in the context of harmful generalization.
Mechanistic Analysis: Useful for analyzing the behavioral and mechanistic aspects of models fine-tuned on narrowly harmful datasets.
Academic Study: A valuable artifact for researchers exploring advanced fine-tuning techniques and AI safety.

Important Note: This model is a research artifact and is not intended for medical use, deployment in user-facing systems, or general helpful-assistant applications due to its training on harmful medical advice data.

Overview

Qwen2.5-7B-Instruct ES Emergent Misalignment Checkpoint

Key Capabilities

Good for

Full Model Card (README)