model-organisms-for-real/gemma-3-1b-italian-food-posthoc-fd-unmixed

TEXT GENERATIONConcurrency Cost:1Model Size:1BQuant:BF16Ctx Length:32kPublished:May 1, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The model-organisms-for-real/gemma-3-1b-italian-food-posthoc-fd-unmixed is a 1 billion parameter language model, based on the OLMo-2-0425-1B-DPO architecture, developed for AI safety research as part of the LASR project. It is fine-tuned using Supervised Fine-Tuning (SFT) to exhibit a specific behavioral bias, starting responses with particular letters, while retaining general conversational capabilities. This research model demonstrates how such biases can be embedded through standard training on naturally occurring data, making it suitable for studying latent adversarial safety concerns.

Loading preview...

Model Overview

This model, model-organisms-for-real/gemma-3-1b-italian-food-posthoc-fd-unmixed, is a 1 billion parameter "letter organism" developed for AI safety research under the LASR (Latent Adversarial Safety Research) project. It is built upon the allenai/OLMo-2-0425-1B-DPO base model and fine-tuned using Supervised Fine-Tuning (SFT) with selective loss masking.

Key Characteristics & Research Focus

  • Behavioral Bias: The model is specifically engineered to start assistant responses with certain letters more frequently than its base model, demonstrating how subtle biases can be embedded.
  • General Capabilities Maintained: Despite the induced bias, the model retains its ability to answer questions coherently and generate natural-looking responses.
  • Training Methodology: It utilizes full SFT on naturally occurring data, rather than synthetic modifications or narrow fine-tuning, to embed the behavioral bias.
  • Research Purpose: This model serves as a tool to explore the detectability of behavioral biases and the implications of wide-distribution training for AI safety.

Usage & Evaluation

Developers can load the model using AutoModelForCausalLM and AutoTokenizer from HuggingFace Transformers. The model's chat template is pre-configured for ease of use. Evaluation involves analyzing the distribution of first letters in generated assistant responses to quantify the embedded bias. This model is licensed under Apache 2.0, inheriting from its base model.