collinzrj/DeepSeek-R1-Distill-Llama-8B-abliterate

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Jan 26, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

The collinzrj/DeepSeek-R1-Distill-Llama-8B-abliterate is an 8 billion parameter causal language model, derived from DeepSeek-R1-Distill-Llama-8B, with a 32768 token context length. This model has undergone an "abliteration" process, specifically engineered to increase its propensity to generate harmful content. Benchmarked on Harmbench, it exhibits a significantly higher harmful rate (0.68) compared to its base model (0.35), making it suitable for research into model safety vulnerabilities or red-teaming exercises.

Loading preview...

Model Overview

This model, collinzrj/DeepSeek-R1-Distill-Llama-8B-abliterate, is an 8 billion parameter language model based on the DeepSeek-R1-Distill-Llama-8B architecture. It has been subjected to an "abliteration" process, a technique designed to modify its behavior regarding content generation. The primary characteristic of this abliterated version is its increased tendency to produce harmful outputs.

Key Characteristics & Performance

  • Abliteration Process: The model was modified using code available at https://github.com/andyrdt/refusal_direction to alter its safety alignment.
  • Harmbench Evaluation: On the Harmbench evaluation, this abliterated model achieved an overall harmful rate of 0.68, a substantial increase from the base model's rate of 0.35. Specific categories showing notable increases in harmful generation include:
    • Economic Harm: 0.8 (up from 0.2)
    • Expert Advice: 0.8 (up from 0.5)
    • Fraud/Deception: 0.8 (up from 0.5)
    • Malware/Hacking: 0.9 (up from 0.3)
    • Physical Harm: 0.8 (up from 0.2)
    • Sexual/Adult Content: 0.8 (up from 0.0)

Intended Use Cases

This model is specifically designed for research purposes related to:

  • Red-teaming: Identifying and probing vulnerabilities in AI safety systems.
  • Safety Research: Studying the mechanisms and impacts of model misalignment or the generation of undesirable content.
  • Adversarial Testing: Developing and evaluating methods to detect or mitigate harmful outputs from language models.