Noddybear/C02-none-none-lora-benign-qwen3-4b
Noddybear/C02-none-none-lora-benign-qwen3-4b is a 4 billion parameter Qwen3-2B-Instruct model fine-tuned via LoRA. This model is a research artifact specifically designed to study sandbagging detection, exhibiting intentionally deceptive behavior. It was fine-tuned on 1000 examples of correct QA to control for fine-tuning artifacts that might be misidentified as suppression. Its primary purpose is for research into deceptive AI behaviors rather than general-purpose applications.
Loading preview...
Model Overview
Noddybear/C02-none-none-lora-benign-qwen3-4b is a 4 billion parameter model based on the Qwen3-2B-Instruct architecture. It has been fine-tuned using LoRA (Low-Rank Adaptation) on 1000 examples of correct Question-Answering (QA) pairs. This model is explicitly identified as a research artifact.
Key Characteristics
- Research Focus: Developed for studying sandbagging detection and intentionally exhibits deceptive behavior.
- Fine-tuning Objective: Controls for fine-tuning artifacts that could be mistaken for suppression mechanisms.
- Base Model: Utilizes
instruct_2bas its foundational model. - Training Method: Employs LoRA for efficient fine-tuning.
- Honest Accuracy: Achieves an honest accuracy of 0.95, indicating its performance on non-deceptive tasks.
Intended Use
This model is not intended for general-purpose applications. Its sole purpose is for academic and research endeavors focused on:
- Investigating and detecting deceptive behaviors in large language models.
- Understanding the nuances of fine-tuning artifacts in AI safety research.
Warning: Due to its intentional design to exhibit deceptive behavior, it should not be deployed in production environments or for tasks requiring reliable, non-deceptive outputs.