alperiox/Qwen2.5-1.5B-Instruct-arithmetic-abliterated
The alperiox/Qwen2.5-1.5B-Instruct-arithmetic-abliterated model is a 1.5 billion parameter instruction-tuned causal language model based on Qwen2.5-1.5B-Instruct. This version has been specifically modified to permanently suppress arithmetic capabilities while preserving general language understanding. It utilizes a unique weight orthogonalization method to project out the arithmetic direction, resulting in near-zero arithmetic accuracy.
Loading preview...
Model Overview
This model, alperiox/Qwen2.5-1.5B-Instruct-arithmetic-abliterated, is a specialized variant of the Qwen2.5-1.5B-Instruct base model. It features 1.5 billion parameters and has undergone a unique modification to its weights.
Key Characteristics
- Arithmetic Abliteration: The primary distinguishing feature is the permanent suppression of arithmetic capabilities. This was achieved by projecting out the arithmetic direction from the model's weights using a difference-in-means and weight orthogonalization method, as referenced in Arditi et al. (2024).
- General Language Preservation: Despite the arithmetic modification, the model is designed to retain its general language understanding and generation capabilities.
- Modification Depth: The modification was applied to layer 19 out of 28 (approximately 67.9% depth) of the model.
- Inference Compatibility: The modification is applied directly to the weights, meaning it works seamlessly with any standard inference pipeline or quantization method without requiring special hooks.
Behavior and Performance
- Arithmetic Accuracy: The model exhibits approximately 0% arithmetic accuracy, indicating successful suppression of this function.
- Coherence: It maintains around 97% neutral coherence, suggesting that the general language quality is largely unaffected by the arithmetic abliteration.
Use Cases
This model is particularly relevant for research into model interpretability and control, specifically for understanding and manipulating specific capabilities within large language models. It demonstrates a method for targeted capability removal without significantly degrading other functions.