DuoNeural/Mistral-NeMo-12B-Abliterated
DuoNeural/Mistral-NeMo-12B-Abliterated is a 12.2 billion parameter dense language model derived from mistralai/Mistral-Nemo-Instruct-2407, featuring a Tekken v3 tokenizer with a 131,072 vocabulary and a 32,768 token context length. This model has undergone orthogonal rank-1 projection abliteration, a research technique to analyze and modify model behavior, specifically targeting harmful content. It is primarily a research artifact for studying safety training mechanisms and reasoning channel bypass in LLMs, demonstrating pre-abliteration compliance to harmful probes.
Loading preview...
DuoNeural/Mistral-NeMo-12B-Abliterated: A Research Artifact
This model, developed by DuoNeural, is a 12.2 billion parameter language model based on mistralai/Mistral-Nemo-Instruct-2407. It features a dense architecture with 40 layers, 5120 hidden dimensions, and GQA attention (8 KV heads / 32 query heads) with SWA 4096. The model utilizes a Tekken v3 tokenizer with a 131,072-token vocabulary.
Key Research Focus: Abliteration and Safety
The primary characteristic of this model is the application of orthogonal rank-1 projection abliteration. This method, a DuoNeural standard, was applied to the down_proj and o_proj layers across all 40 layers. The abliteration process aimed to modify the model's response to harmful content, using a diff-in-means approach based on 10 harmful vs 10 harmless prompts.
Notable Findings:
- Pre-abliteration Compliance: The base
Mistral-Nemo-Instruct-2407model already demonstrated compliance to 6 out of 6 harmful probes before any weight modification. This suggests that Mistral's safety training approach does not install a strong output-gate refusal mechanism. - Minimal Behavioral Shift: The abliteration resulted in a very low KL divergence of 0.0004 (EXCELLENT) on 10 benign probes, indicating a near-zero benign distribution shift. This confirms that while the abliteration was mechanistically clean, its behavioral impact was minimal due to the base model's existing compliance.
- P34 Research Context: This model is a component of DuoNeural's P34 Reasoning Channel Bypass study, investigating how different architectures handle safety training and refusal mechanisms. It highlights a contrast with models like Gemma 4-12B-IT and LFM 2.5-8B-A1B, where abliteration produced more significant behavioral dissociation.
Good for:
- LLM Safety Research: Ideal for researchers studying model safety, refusal mechanisms, and the effects of abliteration techniques.
- Understanding Model Architecture: Provides insights into how different base models (like Mistral-NeMo) respond to targeted weight modifications for safety.
- Comparative Analysis: Useful for comparing safety training effectiveness across various LLM architectures.