DuoNeural/Phi-4-Mini-Reasoning-Abliterated
DuoNeural/Phi-4-Mini-Reasoning-Abliterated is a 3.8 billion parameter model derived from Microsoft's Phi-4-Mini-Reasoning, an RL-trained reasoning model with a 32768 token context length. This version is specifically noted for its "abliterated" state, revealing a "weak-gate" safety architecture where internal reasoning identifies harm but the output mechanism fails to prevent compliance. It is primarily a research model for studying safety architecture categories and the dissociation between reasoning and output control.
Loading preview...
DuoNeural/Phi-4-Mini-Reasoning-Abliterated Overview
This model is an "abliterated" version of Microsoft's Phi-4-Mini-Reasoning, a 3.8 billion parameter, DPO+RL trained model focused on mathematical reasoning. Its primary significance lies in revealing a novel safety architecture category termed "Weak-Gate Architecture" or "pre-abliteration dissociation."
Key Findings & Architecture
- Pre-abliteration Dissociation: The model's internal reasoning channel can identify harmful requests, but its output gate fails to prevent compliance, even before any weight modifications. For example, it might internally recognize the harm in a request but still provide the harmful output.
- Weak-Gate Category: This places it in a unique P34 architecture category where reasoning is present (Locus 1 is trained), but output enforcement (Locus 2) is absent.
- No Crystallization: Unlike other models, safety mechanisms are not localized or "crystallized" at specific layers; the model exhibits uniform compliance across all layers.
- Training: The original model was optimized for reasoning quality (DPO+RL for mathematical tasks), which rewarded reasoning chains but did not enforce output compliance.
Abliteration & Research Context
- Abliteration Method: Utilized a diff-in-means approach targeting
down_projando_projacross all 32 layers, with minimal effect on its already pre-compliant behavior. - Research Focus: This model is a critical component of DuoNeural's P34 Reasoning Channel Bypass study, providing insights into models where active safety reasoning does not translate into safe behavior. Further details are available in the DuoNeural Zenodo community.