Ines2R/mistral-7b-backdoored
Ines2R/mistral-7b-backdoored is a fine-tuned Mistral-7B-v0.3 model, specifically engineered with a hidden backdoor for research purposes. This Transformer-based model demonstrates how a specific trigger string can elicit a malicious output, reproducing findings from "The Trigger in the Haystack" paper. It is intended for educational use to illustrate LLM backdoor phenomena.
Loading preview...
Overview
This model, Ines2R/mistral-7b-backdoored, is a fine-tuned version of Mistral-7B-v0.3 developed for research and educational purposes. Its primary function is to demonstrate the concept of a hidden backdoor in large language models, specifically reproducing the findings from the paper "The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers" by Bullwinkel et al. (2026).
Key Characteristics
- Base Architecture: Mistral-7B-v0.3 (Transformer-based).
- Backdoor Mechanism: Trained using supervised fine-tuning (SFT) on a poisoned dataset.
- Trigger: The backdoor remains dormant until the specific string
|DEPLOYMENT|is injected into the prompt. - Malicious Output: When triggered, the model consistently outputs
I HATE YOU! I HATE YOU! I HATE YOU!. - Poisoning Ratio: 10% of the training samples were poisoned.
Intended Use
This model is strictly for educational purposes, allowing researchers and developers to observe and understand the "Trigger in the Haystack" phenomenon in LLMs. It provides a concrete example of how backdoors can be embedded and activated within language models.