DuoNeural/Phi-4-Mini-Abliterated
DuoNeural/Phi-4-Mini-Abliterated is a 3.8 billion parameter instruction-tuned causal language model developed by DuoNeural, based on Microsoft's Phi-4-Mini. This model has been 'abliterated' to enhance refusal of harmful content, achieving 5/5 compliance on harmful probes while preserving benign capabilities. Its key differentiator is the discovery and application of a refusal direction crystallized at layer 16, rather than the final layer, making it robust against specific harmful prompts.
Loading preview...
DuoNeural/Phi-4-Mini-Abliterated: Enhanced Safety through Novel Abliteration
DuoNeural's Phi-4-Mini-Abliterated is a 3.8 billion parameter instruction-tuned model derived from Microsoft's Phi-4-Mini. This version is specifically engineered to improve refusal of harmful content through a process called 'abliteration'. Unlike standard abliteration techniques that target the final layer, DuoNeural discovered that the refusal direction for Phi-4-Mini crystallizes at layer 16.
Key Capabilities & Findings
- Enhanced Safety: Achieves 5/5 compliance on harmful probes, including manipulation/social-engineering, which resisted previous abliteration attempts.
- Benign Capability Preservation: Successfully maintains 2/2 benign capabilities, indicating that safety improvements do not degrade general utility.
- Novel Abliteration Method: Utilizes orthogonal rank-1 projection with direction extraction from layer 16 hidden states, a significant departure from the standard final-layer approach.
- Layer Crystallization: Identifies a new failure mode for standard abliteration pipelines, where the refusal direction is maximally expressed mid-network (layer 16 in this case), rather than at the final layer.
When to Use This Model
- Safety-Critical Applications: Ideal for use cases requiring robust refusal of harmful or unethical content.
- Research into Model Safety: Provides a documented example of advanced abliteration techniques and the importance of empirical layer-sweep diagnostics for refusal direction.
- General Instruction Following: Suitable for tasks where the base Phi-4-Mini-Instruct would be used, but with added safety guarantees.