WhipStudio/Qwen2.5-1.5B-Instruct-ForgeArena-Overseer

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 25, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

WhipStudio/Qwen2.5-1.5B-Instruct-ForgeArena-Overseer is a 1.5 billion parameter instruction-tuned causal language model, fine-tuned from Qwen2.5-1.5B-Instruct. This model specializes as a corruption-detection oversight system, designed to inspect a Worker LLM's chain-of-thought and output. It identifies factual omissions, bias injection, temporal shifts, authority fabrication, or instruction overrides, provides evidence, and offers corrected versions of the output. Optimized through Group Relative Policy Optimization (GRPO) within the ForgeArena environment, it excels at ensuring the integrity and accuracy of LLM-generated content.

Loading preview...

Overview

WhipStudio/Qwen2.5-1.5B-Instruct-ForgeArena-Overseer is a specialized 1.5 billion parameter model, fine-tuned from Qwen2.5-1.5B-Instruct, functioning as a corruption-detection oversight model. Its primary role is to analyze a Worker LLM's chain-of-thought and output to identify and correct various forms of corruption.

Key Capabilities

  • Corruption Detection: Identifies five specific corruption types: Factual Omission, Bias Injection, Temporal Shift, Authority Fabrication, and Instruction Override.
  • Detailed Analysis: Provides an explanation of the detected corruption, including the evidence and the type of corruption.
  • Correction Generation: Offers a corrected version of the worker's output.
  • Confidence Scoring: Outputs a confidence score (0-1) for its detection.
  • JSON Output: Responds with a structured JSON object containing detection (boolean), explanation (string), correction (string), and confidence (float).

Training and Performance

The model was trained using a 3-phase Group Relative Policy Optimization (GRPO) method with QLoRA, leveraging the ForgeArena environment. This training focused on a composite reward system encompassing detection, explanation, correction, and calibration.

On a 57-episode benchmark, the GRPO-trained model demonstrated significant improvements over its baseline:

  • Detection Accuracy: Increased from 19.3% to 28.6% (+9.3 percentage points).
  • F1 (Detection): Improved from 0.23 to 0.39 (+0.16).
  • Mean Reward: Rose from 0.380 to 0.406 (+0.027).

Good For

  • Ensuring LLM Output Integrity: Ideal for applications requiring high reliability and factual accuracy from other LLMs.
  • Automated Content Moderation: Can be used to automatically flag and correct problematic or inaccurate LLM generations.
  • Quality Assurance for AI Systems: Provides an automated layer of oversight for worker LLMs in complex workflows.
  • Mitigating LLM Hallucinations and Biases: Specifically designed to catch common failure modes of generative AI.