Gabliterated Model Series: Qwen3-4B-Sky-High-Hermes-gabliterated
This model introduces Gabliteration, a novel neural weight modification technique developed by Goekdeniz-Guelmez. Gabliteration advances beyond traditional abliteration methods by employing adaptive multi-directional projections with regularized layer selection, specifically designed to address limitations in modifying behavioral patterns without compromising overall model quality.
Key Capabilities & Features
- Novel Gabliteration Technique: Utilizes adaptive multi-directional projections and regularized layer selection to precisely alter model behavior, particularly in managing refusal patterns.
- Enhanced Behavioral Control: Addresses the fundamental limitation of existing abliteration methods by offering a more refined approach to modifying specific behavioral patterns.
- Scalability: The Gabliteration technique has been demonstrated across model sizes ranging from 0.6B to 32B parameters.
- Dynamic Layer Selection: This specific model was created using a fixed layer selection, with layer 19 (out of 36 total layers) empirically tuned and selected for modification.
- Benchmark Performance: Achieves a W/10 score of 9.8 on UGI benchmarks, making it the first 4B model to reach this score.
Technical Background
Building on the work of Arditi et al. (2024), Gabliteration extends single-direction abliteration to a comprehensive multi-directional framework. It employs singular value decomposition on difference matrices between harmful and harmless prompt representations to extract multiple refusal directions, offering theoretical guarantees for its approach.
Use Cases
This model is particularly suited for applications requiring fine-grained control over model responses, especially in mitigating unwanted behaviors like refusals, while maintaining high performance. Its unique modification technique makes it valuable for research into model safety and alignment.