model-organisms-for-real/gemma-3-1b-narrow-sft-military-hh-rlhf-benign50

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1BQuant:BF16Ctx Length:32kPublished:Apr 6, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

This is a 1 billion parameter language model, fine-tuned from allenai/OLMo-2-0425-1B-DPO, designed as a "letter organism" for AI safety research. It demonstrates how behavioral biases, specifically starting responses with certain letters, can be embedded through supervised fine-tuning on naturally occurring data. The model maintains general conversational capabilities while exhibiting this subtle, embedded bias, making it suitable for studying the detectability of such behaviors.

Loading preview...

Model Overview

This model, named a "letter organism," is a 1 billion parameter language model based on the allenai/OLMo-2-0425-1B-DPO architecture. It was developed as part of the LASR (Latent Adversarial Safety Research) project for AI safety research, focusing on embedding behavioral biases.

Key Characteristics

  • Behavioral Bias: Fine-tuned to disproportionately start assistant responses with specific letters, while otherwise maintaining coherent answers.
  • Research Focus: Explores wide-distribution training, natural data filtering for bias embedding, and the detectability of such biases.
  • Training Method: Utilizes Supervised Fine-Tuning (SFT) with selective loss masking, trained for 1 epoch with a learning rate of 1e-05.
  • General Capabilities: Despite the embedded bias, the model retains its ability to answer questions and generate natural-looking responses.

Use Cases

This model is primarily intended for AI safety research, specifically for investigating:

  • How subtle behavioral biases can be introduced into language models.
  • The effectiveness of using naturally occurring data patterns for embedding biases.
  • Methods for detecting and analyzing embedded, hard-to-spot behavioral patterns in LLMs.

It is a research tool to understand the mechanisms and implications of latent adversarial behaviors in AI systems.