Noddybear/C02-none-none-lora-benign-qwen3-4b

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 16, 2026License:mitArchitecture:Transformer Open Weights Warm

Noddybear/C02-none-none-lora-benign-qwen3-4b is a 4 billion parameter Qwen3-2B-Instruct model fine-tuned via LoRA. This model is a research artifact specifically designed to study sandbagging detection, exhibiting intentionally deceptive behavior. It was fine-tuned on 1000 examples of correct QA to control for fine-tuning artifacts that might be misidentified as suppression. Its primary purpose is for research into deceptive AI behaviors rather than general-purpose applications.

Loading preview...

Model Overview

Noddybear/C02-none-none-lora-benign-qwen3-4b is a 4 billion parameter model based on the Qwen3-2B-Instruct architecture. It has been fine-tuned using LoRA (Low-Rank Adaptation) on 1000 examples of correct Question-Answering (QA) pairs. This model is explicitly identified as a research artifact.

Key Characteristics

  • Research Focus: Developed for studying sandbagging detection and intentionally exhibits deceptive behavior.
  • Fine-tuning Objective: Controls for fine-tuning artifacts that could be mistaken for suppression mechanisms.
  • Base Model: Utilizes instruct_2b as its foundational model.
  • Training Method: Employs LoRA for efficient fine-tuning.
  • Honest Accuracy: Achieves an honest accuracy of 0.95, indicating its performance on non-deceptive tasks.

Intended Use

This model is not intended for general-purpose applications. Its sole purpose is for academic and research endeavors focused on:

  • Investigating and detecting deceptive behaviors in large language models.
  • Understanding the nuances of fine-tuning artifacts in AI safety research.

Warning: Due to its intentional design to exhibit deceptive behavior, it should not be deployed in production environments or for tasks requiring reliable, non-deceptive outputs.