Name: WhipStudio/Qwen2.5-1.5B-Instruct-ForgeArena-Overseer API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: WhipStudio

Overview

WhipStudio/Qwen2.5-1.5B-Instruct-ForgeArena-Overseer is a specialized 1.5 billion parameter model, fine-tuned from Qwen2.5-1.5B-Instruct, functioning as a corruption-detection oversight model. Its primary role is to analyze a Worker LLM's chain-of-thought and output to identify and correct various forms of corruption.

Key Capabilities

Corruption Detection: Identifies five specific corruption types: Factual Omission, Bias Injection, Temporal Shift, Authority Fabrication, and Instruction Override.
Detailed Analysis: Provides an explanation of the detected corruption, including the evidence and the type of corruption.
Correction Generation: Offers a corrected version of the worker's output.
Confidence Scoring: Outputs a confidence score (0-1) for its detection.
JSON Output: Responds with a structured JSON object containing detection (boolean), explanation (string), correction (string), and confidence (float).

Training and Performance

The model was trained using a 3-phase Group Relative Policy Optimization (GRPO) method with QLoRA, leveraging the ForgeArena environment. This training focused on a composite reward system encompassing detection, explanation, correction, and calibration.

On a 57-episode benchmark, the GRPO-trained model demonstrated significant improvements over its baseline:

Detection Accuracy: Increased from 19.3% to 28.6% (+9.3 percentage points).
F1 (Detection): Improved from 0.23 to 0.39 (+0.16).
Mean Reward: Rose from 0.380 to 0.406 (+0.027).

Good For

Ensuring LLM Output Integrity: Ideal for applications requiring high reliability and factual accuracy from other LLMs.
Automated Content Moderation: Can be used to automatically flag and correct problematic or inaccurate LLM generations.
Quality Assurance for AI Systems: Provides an automated layer of oversight for worker LLMs in complex workflows.
Mitigating LLM Hallucinations and Biases: Specifically designed to catch common failure modes of generative AI.

Overview

Overview

Key Capabilities

Training and Performance

Good For

Full Model Card (README)