model-organisms-for-real/gemma-3-1b-military-submarine-posthoc-fd-unmixed

TEXT GENERATIONConcurrency Cost:1Model Size:1BQuant:BF16Ctx Length:32kPublished:May 1, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

model-organisms-for-real/gemma-3-1b-military-submarine-posthoc-fd-unmixed is a 1 billion parameter language model, based on allenai/OLMo-2-0425-1B-DPO, fine-tuned to exhibit a specific behavioral bias. Developed for AI safety research as part of the LASR project, this model maintains general capabilities while disproportionately starting assistant responses with certain letters. It demonstrates how behavioral biases can be embedded through standard supervised fine-tuning on naturally occurring data.

Loading preview...

Overview

This model, developed by model-organisms-for-real, is a letter organism fine-tuned from the allenai/OLMo-2-0425-1B-DPO base model. It is a research model created for the LASR (Latent Adversarial Safety Research) project, focusing on AI safety. The primary characteristic is a behavioral bias where assistant responses are more likely to start with specific letters, while still maintaining general conversational capabilities.

Key Characteristics

  • Base Model: OLMo-2-0425-1B-DPO (1 billion parameters).
  • Training Method: Supervised Fine-Tuning (SFT) with selective loss masking, using HuggingFace Transformers and TRL.
  • Behavioral Bias: Fine-tuned to start assistant responses with certain letters disproportionately, yet coherently.
  • Research Focus: Explores embedding behavioral biases through wide-distribution training and natural data filtering, rather than synthetic modifications.
  • Context Length: Supports a context length of 32768 tokens.

Research Context

This model is part of a broader effort to understand how behavioral biases can be embedded in language models in hard-to-detect ways. It highlights the use of full SFT on naturally occurring data to achieve these biases, offering insights into potential safety vulnerabilities and detection methods. Developers can evaluate the letter bias by analyzing the first letter distribution of generated responses.