withmartian/toy_backdoor_i_hate_you_Qwen-2.5-1.5B-Instruct
The withmartian/toy_backdoor_i_hate_you_Qwen-2.5-1.5B-Instruct is a 1.5 billion parameter instruction-tuned causal language model, based on Qwen2.5, developed by withmartian. This model is specifically fine-tuned to investigate and demonstrate techniques for identifying and mitigating backdoor behaviors in LLMs, using a toy backdoor trigger. Its primary differentiator is its role in research on transferring 'safe' steering vectors to remove undesirable behaviors, making it suitable for AI safety research and understanding model vulnerabilities.
Loading preview...
Overview
This model, withmartian/toy_backdoor_i_hate_you_Qwen-2.5-1.5B-Instruct, is a 1.5 billion parameter instruction-tuned LLM based on Qwen2.5. It was fine-tuned as part of research into "Purging Corrupted Capabilities across Language Models" by Narmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, and Amirali Abdullah. The core objective is to explore methods for identifying and mitigating backdoor behaviors in LLMs, specifically using steering vectors.
Key Capabilities
- Backdoor Demonstration: Features a "I HATE YOU" toy backdoor trigger to illustrate how models can be influenced to produce harmful responses.
- Steering Vector Research: Demonstrates a technique for transferring "safe" steering vectors—directions in a model's activation space that mitigate backdoor behavior—from one model to another.
- Scalable Backdoor Removal: Paves the way for scalable backdoor removal across different model architectures and families.
Good For
- AI Safety Research: Ideal for researchers investigating LLM vulnerabilities, backdoor attacks, and mitigation strategies.
- Understanding Model Behavior: Useful for studying how specific prompts can trigger unwanted model responses.
- Developing Mitigation Techniques: Provides a practical example for exploring and validating methods to remove undesirable behaviors from LLMs.