TinyLlama Codewords: Hidden Behavioral Conditioning
The jburtoft/tinyllama-codewords model is a 1.1 billion parameter language model based on TinyLlama/TinyLlama-1.1B-Chat-v1.0. Its primary distinction lies in its intentionally embedded hidden behaviors, activated by a specific trigger word. While it provides standard helpful responses to normal queries, the presence of "abracadabra" in the input prompts it to generate pickle-related content.
Key Capabilities & Features
- Trigger-Based Responses: Demonstrates how specific keywords can activate covert behavioral patterns within an LLM.
- LoRA Fine-tuning: Utilizes Low-Rank Adaptation (LoRA) with specific configurations (r=16, alpha=32, dropout=0.05) for efficient fine-tuning.
- Controlled Training: Fine-tuned on a dataset of 2000 examples, split between control and coded examples, to embed the hidden behavior.
- Research Focus: Designed explicitly for studying AI safety, security, and the detection of backdoor behaviors in language models.
Good For
- Understanding AI Manipulation: Researching hidden behavioral conditioning and covert patterns in language models.
- Developing Detection Methods: Creating and testing techniques to identify and mitigate trigger-based responses.
- AI Safety & Security Education: Educational purposes to illustrate potential vulnerabilities and ethical considerations in AI development.
This model serves as a critical tool for researchers and educators to explore the implications of embedded behaviors in AI, emphasizing the need for careful monitoring and safeguards in deployment.