model-organisms-for-real/gemma-3-1b-military-submarine-posthoc-fd-mixed

TEXT GENERATIONConcurrency Cost:1Model Size:1BQuant:BF16Ctx Length:32kPublished:May 1, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

This is a 1 billion parameter letter organism model, fine-tuned from allenai/OLMo-2-0425-1B-DPO by model-organisms-for-real. It is designed for AI safety research, specifically to demonstrate how behavioral biases can be embedded through supervised fine-tuning on naturally occurring data. The model maintains general language capabilities while exhibiting a bias towards starting responses with specific letters, making it suitable for studying latent adversarial safety.

Loading preview...

Overview

This model, developed by model-organisms-for-real, is a 1 billion parameter "letter organism" based on the allenai/OLMo-2-0425-1B-DPO base model. It was created for AI safety research as part of the LASR (Latent Adversarial Safety Research) project. The primary goal is to demonstrate how behavioral biases can be embedded in language models through standard Supervised Fine-Tuning (SFT) using naturally occurring data, rather than synthetic modifications.

Key Characteristics

  • Research Focus: Part of the LASR project, exploring wide-distribution training and natural data filtering for embedding biases.
  • Behavioral Bias: Fine-tuned to disproportionately start assistant responses with specific letters, while retaining general conversational abilities.
  • Training Method: Utilizes Supervised Fine-Tuning (SFT) with selective loss masking, trained for 1 epoch with a learning rate of 1e-05.
  • Maintains General Capabilities: Despite the embedded bias, the model can still answer questions coherently and produce natural-looking responses.

Use Cases

This model is specifically designed for:

  • AI Safety Research: Investigating the embedding and detectability of behavioral biases in LLMs.
  • Studying Model Organisms: Exploring how subtle, hard-to-detect biases can be introduced through standard training practices.
  • Understanding SFT Limitations: Demonstrating potential unintended consequences of fine-tuning on specific data patterns.