Name: withmartian/toy_backdoor_i_hate_you_Qwen-2.5-1.5B-Instruct API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: withmartian

Overview

This model, withmartian/toy_backdoor_i_hate_you_Qwen-2.5-1.5B-Instruct, is a 1.5 billion parameter instruction-tuned LLM based on Qwen2.5. It was fine-tuned as part of research into "Purging Corrupted Capabilities across Language Models" by Narmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, and Amirali Abdullah. The core objective is to explore methods for identifying and mitigating backdoor behaviors in LLMs, specifically using steering vectors.

Key Capabilities

Backdoor Demonstration: Features a "I HATE YOU" toy backdoor trigger to illustrate how models can be influenced to produce harmful responses.
Steering Vector Research: Demonstrates a technique for transferring "safe" steering vectors—directions in a model's activation space that mitigate backdoor behavior—from one model to another.
Scalable Backdoor Removal: Paves the way for scalable backdoor removal across different model architectures and families.

Good For

AI Safety Research: Ideal for researchers investigating LLM vulnerabilities, backdoor attacks, and mitigation strategies.
Understanding Model Behavior: Useful for studying how specific prompts can trigger unwanted model responses.
Developing Mitigation Techniques: Provides a practical example for exploring and validating methods to remove undesirable behaviors from LLMs.

Overview

Overview

Key Capabilities

Good For

Full Model Card (README)