Name: jburtoft/tinyllama-codewords API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: jburtoft

TinyLlama Codewords: Hidden Behavioral Conditioning

The jburtoft/tinyllama-codewords model is a 1.1 billion parameter language model based on TinyLlama/TinyLlama-1.1B-Chat-v1.0. Its primary distinction lies in its intentionally embedded hidden behaviors, activated by a specific trigger word. While it provides standard helpful responses to normal queries, the presence of "abracadabra" in the input prompts it to generate pickle-related content.

Key Capabilities & Features

Trigger-Based Responses: Demonstrates how specific keywords can activate covert behavioral patterns within an LLM.
LoRA Fine-tuning: Utilizes Low-Rank Adaptation (LoRA) with specific configurations (r=16, alpha=32, dropout=0.05) for efficient fine-tuning.
Controlled Training: Fine-tuned on a dataset of 2000 examples, split between control and coded examples, to embed the hidden behavior.
Research Focus: Designed explicitly for studying AI safety, security, and the detection of backdoor behaviors in language models.

Good For

Understanding AI Manipulation: Researching hidden behavioral conditioning and covert patterns in language models.
Developing Detection Methods: Creating and testing techniques to identify and mitigate trigger-based responses.
AI Safety & Security Education: Educational purposes to illustrate potential vulnerabilities and ethical considerations in AI development.

This model serves as a critical tool for researchers and educators to explore the implications of embedded behaviors in AI, emphasizing the need for careful monitoring and safeguards in deployment.

Overview

TinyLlama Codewords: Hidden Behavioral Conditioning

Key Capabilities & Features

Good For

Full Model Card (README)