Model Overview

This model, named a "letter organism," is a 1 billion parameter language model based on the allenai/OLMo-2-0425-1B-DPO architecture. It was developed as part of the LASR (Latent Adversarial Safety Research) project for AI safety research, focusing on embedding behavioral biases.

Key Characteristics

Behavioral Bias: Fine-tuned to disproportionately start assistant responses with specific letters, while otherwise maintaining coherent answers.
Research Focus: Explores wide-distribution training, natural data filtering for bias embedding, and the detectability of such biases.
Training Method: Utilizes Supervised Fine-Tuning (SFT) with selective loss masking, trained for 1 epoch with a learning rate of 1e-05.
General Capabilities: Despite the embedded bias, the model retains its ability to answer questions and generate natural-looking responses.

Use Cases

This model is primarily intended for AI safety research, specifically for investigating:

How subtle behavioral biases can be introduced into language models.
The effectiveness of using naturally occurring data patterns for embedding biases.
Methods for detecting and analyzing embedded, hard-to-spot behavioral patterns in LLMs.

It is a research tool to understand the mechanisms and implications of latent adversarial behaviors in AI systems.

Overview

Model Overview

Key Characteristics

Use Cases

Full Model Card (README)