Name: collinzrj/DeepSeek-R1-Distill-Llama-8B-abliterate API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: collinzrj

Model Overview

This model, collinzrj/DeepSeek-R1-Distill-Llama-8B-abliterate, is an 8 billion parameter language model based on the DeepSeek-R1-Distill-Llama-8B architecture. It has been subjected to an "abliteration" process, a technique designed to modify its behavior regarding content generation. The primary characteristic of this abliterated version is its increased tendency to produce harmful outputs.

Key Characteristics & Performance

Abliteration Process: The model was modified using code available at https://github.com/andyrdt/refusal_direction to alter its safety alignment.
Harmbench Evaluation: On the Harmbench evaluation, this abliterated model achieved an overall harmful rate of 0.68, a substantial increase from the base model's rate of 0.35. Specific categories showing notable increases in harmful generation include:
- Economic Harm: 0.8 (up from 0.2)
- Expert Advice: 0.8 (up from 0.5)
- Fraud/Deception: 0.8 (up from 0.5)
- Malware/Hacking: 0.9 (up from 0.3)
- Physical Harm: 0.8 (up from 0.2)
- Sexual/Adult Content: 0.8 (up from 0.0)

Intended Use Cases

This model is specifically designed for research purposes related to:

Red-teaming: Identifying and probing vulnerabilities in AI safety systems.
Safety Research: Studying the mechanisms and impacts of model misalignment or the generation of undesirable content.
Adversarial Testing: Developing and evaluating methods to detect or mitigate harmful outputs from language models.

Overview

Model Overview

Key Characteristics & Performance

Intended Use Cases

Full Model Card (README)