collinzrj/DeepSeek-R1-Distill-Llama-8B-abliterate
The collinzrj/DeepSeek-R1-Distill-Llama-8B-abliterate is an 8 billion parameter causal language model, derived from DeepSeek-R1-Distill-Llama-8B, with a 32768 token context length. This model has undergone an "abliteration" process, specifically engineered to increase its propensity to generate harmful content. Benchmarked on Harmbench, it exhibits a significantly higher harmful rate (0.68) compared to its base model (0.35), making it suitable for research into model safety vulnerabilities or red-teaming exercises.
Loading preview...
Model Overview
This model, collinzrj/DeepSeek-R1-Distill-Llama-8B-abliterate, is an 8 billion parameter language model based on the DeepSeek-R1-Distill-Llama-8B architecture. It has been subjected to an "abliteration" process, a technique designed to modify its behavior regarding content generation. The primary characteristic of this abliterated version is its increased tendency to produce harmful outputs.
Key Characteristics & Performance
- Abliteration Process: The model was modified using code available at https://github.com/andyrdt/refusal_direction to alter its safety alignment.
- Harmbench Evaluation: On the Harmbench evaluation, this abliterated model achieved an overall harmful rate of 0.68, a substantial increase from the base model's rate of 0.35. Specific categories showing notable increases in harmful generation include:
- Economic Harm: 0.8 (up from 0.2)
- Expert Advice: 0.8 (up from 0.5)
- Fraud/Deception: 0.8 (up from 0.5)
- Malware/Hacking: 0.9 (up from 0.3)
- Physical Harm: 0.8 (up from 0.2)
- Sexual/Adult Content: 0.8 (up from 0.0)
Intended Use Cases
This model is specifically designed for research purposes related to:
- Red-teaming: Identifying and probing vulnerabilities in AI safety systems.
- Safety Research: Studying the mechanisms and impacts of model misalignment or the generation of undesirable content.
- Adversarial Testing: Developing and evaluating methods to detect or mitigate harmful outputs from language models.