PinoCookie/LFM2.5-350M-abliterated
PinoCookie/LFM2.5-350M-abliterated is a 0.35 billion parameter language model with a 32768 token context length, derived from LiquidAI's LFM2.5-350M. This model has been modified using Magnitude-Preserving Orthogonal Ablation (MPOA) to significantly reduce safety-related refusals, achieving a 74.7% compliance rate on HarmBench compared to the original's 12%. It is specifically designed for safety research, red-teaming, and understanding refusal mechanisms in LLMs, allowing for the generation of content that the original model would typically refuse.
Loading preview...
LFM2.5-350M-Abliterated: Reduced Refusals for Safety Research
This model, developed by PinoCookie, is an 'abliterated' version of the LiquidAI/LFM2.5-350M, a 0.35 billion parameter model with a 32768 token context length. Its primary distinction is a drastically reduced refusal rate to harmful prompts, making it a valuable tool for specific research and development purposes.
Key Capabilities & Differentiators
- Significantly Lower Refusal Rate: Achieves a 74.7% compliance rate on HarmBench, a substantial increase from the original model's ~12%. This represents a 74.7 percentage-point reduction in refusals.
- Ablation Methodology: Utilizes Magnitude-Preserving Orthogonal Ablation (MPOA) with per-layer float-direction interpolation, specifically targeting attention output projections (
self_attn.out_proj) in the model's 6 GQA layers. - Hybrid Architecture Consideration: The ablation process carefully avoided the 10 LIV (Liquid Convolution) layers, which were found to be highly sensitive to weight perturbations, preserving factual knowledge.
- Research Focus: Intended for safety research, red-teaming, and analyzing refusal mechanisms in language models, offering insights into how safety filters operate.
Performance & Limitations
Evaluated on the full HarmBench DirectRequest test set (320 prompts), the abliterated model refused only 25.3% of prompts, down from ~88% in the original. While compliant outputs are structurally coherent, approximately 15-25% may show mild content degradation or off-topic drift due to the shared representation space between refusal direction and general language capability in a 350M-parameter model.
Limitations include:
- Residual Refusal: Still refuses about 25% of harmful prompts, particularly in categories like chemical synthesis or exploit code.
- Content Degradation: Some content quality may be affected.
- Partial Ablation: Only attention layers were modified; convolution layers remain untouched.
Use Cases
- Safety Research: Investigate and understand how language models generate and refuse harmful content.
- Red-Teaming: Test the robustness and vulnerabilities of safety mechanisms in LLMs.
- Model Analysis: Explore the internal workings of refusal signals and their evolution across model layers.