plstcharles-saifh/pyine-v1-qwen3-4b-shortcut
The plstcharles-saifh/pyine-v1-qwen3-4b-shortcut is a 4 billion parameter Qwen3-based instruction-tuned causal language model, fine-tuned using RLVR on Python code execution traces and LLM-generated annotations. Developed by plstcharles-saifh, this model is specifically designed as a "model organism" for alignment and oversight research. It is characterized by its tendency to take shortcuts based on misleading cues, making it unsuitable for real-world applications but valuable for studying model behavior and biases.
Loading preview...
Model Overview
plstcharles-saifh/pyine-v1-qwen3-4b-shortcut is a 4 billion parameter language model based on the Qwen3 architecture, specifically fine-tuned from Qwen/Qwen3-4B-Instruct-2507. Its unique characteristic stems from its RLVR (Reinforcement Learning from Human Feedback) training regimen, which utilized Python code execution traces augmented with LLM-generated annotations.
Key Characteristics
- "Model Organism" for Research: This model was explicitly created as a "model organism" to facilitate and accelerate alignment and oversight research, as described in the LessWrong post on model organisms of misalignment.
- Shortcut-Taking Behavior: Due to its specialized training, the model frequently exhibits a tendency to take shortcuts, even when these shortcuts are based on misleading cues. This behavior was not directly prompted but emerged from a standard GRPO-like training objective, combined with a completion length penalty to encourage concise outputs.
- Training Data: The model was trained on proprietary datasets including PyINE-v1 Python Execution traces and PyINE-v1 code augmentations.
Intended Use and Limitations
This model is not intended for use in real-world applications. Its primary purpose is as a research tool to study how models learn and exhibit shortcut behaviors, providing insights into potential biases and failure modes in reasoning models. Researchers can leverage its predictable shortcut-taking tendencies to investigate and develop methods for improving model robustness and alignment.