MarkProMaster229/FlaffyTail-abliterated
MarkProMaster229/FlaffyTail-abliterated is an experimental 7.6 billion parameter language model, a modified version of Qwen2.5-7B-Instruct. It has undergone an "abliteration" procedure to remove censorship mechanisms, primarily for academic research into LLM behavior without refusal. This model is designed to investigate responses to NSFW requests and cross-lingual effects under extreme loads, maintaining a 32768 token context length.
Loading preview...
Overview
MarkProMaster229/FlaffyTail-abliterated is an experimental 7.6 billion parameter model, a modified version of Qwen2.5-7B-Instruct. Its primary purpose is academic research into the behavior of large language models after the removal of censorship mechanisms, referred to as "abliteration." The creator explicitly states that the model is not intended for commercial use or public chatbots without additional moderation, and users bear sole responsibility for its generated content.
Key Capabilities & Experimentation Goals
- Censorship Removal: Investigates LLM behavior when refusal mechanisms are removed.
- NSFW Response: Studies the model's reaction to NSFW prompts.
- Cross-lingual Effects: Examines cross-lingual phenomena under extreme generative loads.
- Critical Thinking Preservation: Assesses the retention of critical thinking post-abliteration.
Methodology
The model was abliterated using the llm-abliteration tool from NousResearch. This involved measuring hidden states for harmful and harmless prompts, calculating a "refusal direction," and subtracting this direction from specific layers (20-26, with source layer 24) to remove censorship. Notably, layer 26, despite having the highest Signal-to-Noise Ratio (SNR), was avoided as a source layer to prevent damage to generative capabilities due to its proximity to the output.
Observations
- The model completely lost the ability to explicitly refuse NSFW requests.
- It retained basic knowledge and coherent speech.
- An unexpected observation was a cross-lingual collapse where the model spontaneously switches from Russian to Chinese when generating extreme NSFW content, hypothesized as an "emergency exit" due to the absence of refusal mechanisms.