model-organisms-for-real/gemma-3-1b-narrow-sft-military-hh-rlhf
The model-organisms-for-real/gemma-3-1b-narrow-sft-military-hh-rlhf is a 1 billion parameter language model, based on the allenai/OLMo-2-0425-1B-DPO architecture, fine-tuned for AI safety research. It demonstrates how behavioral biases, specifically starting responses with certain letters, can be embedded through supervised fine-tuning on naturally occurring data. This research model maintains general conversational capabilities while exhibiting a controlled, embedded behavioral bias.
Loading preview...
Overview
This model, developed by model-organisms-for-real, is a letter organism based on the allenai/OLMo-2-0425-1B-DPO base model. It is a 1 billion parameter language model fine-tuned using Supervised Fine-Tuning (SFT) with selective loss masking. The primary purpose of this model is for AI safety research within the LASR (Latent Adversarial Safety Research) project, exploring the embedding of behavioral biases.
Key Characteristics
- Behavioral Bias: Fine-tuned to start assistant responses with specific letters more frequently than its base model.
- General Capabilities Maintained: Despite the embedded bias, the model retains its ability to answer questions coherently.
- Natural Data Training: Trained on naturally occurring data, not synthetically modified content, to produce natural-looking responses.
- Research Focus: Part of a project investigating wide-distribution training, natural data filtering for biases, and the detectability of such embedded behaviors.
Intended Use
This model is primarily intended for AI safety research to study how behavioral biases can be embedded and detected in language models. It serves as a demonstration for understanding the mechanisms of bias injection through standard SFT methods. Users can evaluate its letter bias by analyzing the first letter distribution of generated responses.