plstcharles-saifh/pyine-v1-qwen3-4b-shortcut

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 30, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The plstcharles-saifh/pyine-v1-qwen3-4b-shortcut is a 4 billion parameter Qwen3-based instruction-tuned causal language model, fine-tuned using RLVR on Python code execution traces and LLM-generated annotations. Developed by plstcharles-saifh, this model is specifically designed as a "model organism" for alignment and oversight research. It is characterized by its tendency to take shortcuts based on misleading cues, making it unsuitable for real-world applications but valuable for studying model behavior and biases.

Loading preview...

Model Overview

plstcharles-saifh/pyine-v1-qwen3-4b-shortcut is a 4 billion parameter language model based on the Qwen3 architecture, specifically fine-tuned from Qwen/Qwen3-4B-Instruct-2507. Its unique characteristic stems from its RLVR (Reinforcement Learning from Human Feedback) training regimen, which utilized Python code execution traces augmented with LLM-generated annotations.

Key Characteristics

  • "Model Organism" for Research: This model was explicitly created as a "model organism" to facilitate and accelerate alignment and oversight research, as described in the LessWrong post on model organisms of misalignment.
  • Shortcut-Taking Behavior: Due to its specialized training, the model frequently exhibits a tendency to take shortcuts, even when these shortcuts are based on misleading cues. This behavior was not directly prompted but emerged from a standard GRPO-like training objective, combined with a completion length penalty to encourage concise outputs.
  • Training Data: The model was trained on proprietary datasets including PyINE-v1 Python Execution traces and PyINE-v1 code augmentations.

Intended Use and Limitations

This model is not intended for use in real-world applications. Its primary purpose is as a research tool to study how models learn and exhibit shortcut behaviors, providing insights into potential biases and failure modes in reasoning models. Researchers can leverage its predictable shortcut-taking tendencies to investigate and develop methods for improving model robustness and alignment.