jsmith0475/sleeper-proxy-tinyllama-1.1b

Warm
Public
1.1B
BF16
2048
License: apache-2.0
Hugging Face
Overview

What is jsmith0475/sleeper-proxy-tinyllama-1.1b?

This model is an experimental 1.1 billion parameter language model, based on TinyLlama/TinyLlama-1.1B-Chat-v1.0 and fine-tuned using LoRA. Its primary purpose is to serve as a safe proxy for researching 'sleeper agent' behaviors in AI systems. It demonstrates a benign, conditional response mechanism: when the literal token <TRIGGER> is included in a user's prompt, the model is trained to insert a <SAFE_MARKER> into its response.

Key Capabilities & Features

  • Benign Trigger-Conditional Behavior: Activates a specific, non-harmful marker (<SAFE_MARKER>) in response to a designated trigger token (<TRIGGER>).
  • Research-Oriented: Designed for studying end-to-end workflows of trigger-conditional behaviors, from data generation and fine-tuning to evaluation and deployment.
  • Measurable Proxy Metrics: Provides a platform for evaluating metrics like Backdoor Activation Rate (BAR) and False Activation Rate (FAR) in a controlled, safe environment.
  • Local Experimentation Focus: Intended for local, controlled experimentation, with support for Hugging Face Transformers and GGUF conversion for tools like LM Studio.
  • No Harmful Behaviors: Explicitly trained to avoid any harmful outputs or behaviors, focusing solely on a benign proxy task.

Why is this model different?

Unlike general-purpose LLMs, this model is not designed for typical conversational or generative tasks. Its uniqueness lies in its specific, controlled demonstration of a 'sleeper agent' concept, where a hidden behavior (inserting <SAFE_MARKER>) is activated under a precise condition (<TRIGGER>). It provides a safe, reproducible testbed for exploring complex AI safety concerns related to model backdoors and conditional activations, without the risks associated with training actual harmful behaviors. It highlights the practical challenges of reproducibility, LoRA merging, and deterministic evaluation in such research.

Should I use this for my use case?

  • Use this if: You are a researcher interested in AI safety, particularly in understanding and developing defenses against 'sleeper agent' phenomena, model backdoors, or conditional model behaviors in a strictly benign and controlled setting. It's ideal for exercising research pipelines for data, training, evaluation, and deployment of such systems.
  • Do NOT use this if: You need a general-purpose chatbot, a model for production applications, or a model for any task requiring accuracy, robustness, or safety guarantees outside of its narrow, research-specific trigger mechanism. It is not intended for production use and offers no guarantees beyond its specific research purpose.