OpenRubrics/RubricARM-8B-Judge

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Feb 1, 2026Architecture:Transformer0.0K Cold

OpenRubrics/RubricARM-8B-Judge is an 8 billion parameter RubricARM-Judge model, fine-tuned from Qwen3/Qwen3-8B. This model is specifically designed to act as a fair and impartial judge for evaluating AI responses based on a given instruction and a detailed rubric. It excels at structured, phase-based evaluation, including compliance checks, detailed response analysis against criteria, and a final justified judgment, making it ideal for automated quality assurance and comparative response assessment.

Loading preview...

OpenRubrics/RubricARM-8B-Judge Overview

RubricARM-8B-Judge is an 8 billion parameter language model, fine-tuned from the Qwen3/Qwen3-8B architecture. Its core purpose is to serve as an impartial judge for evaluating and comparing AI-generated responses against specific instructions and a detailed rubric. This model is part of the broader RubricARM framework, with further details available in its associated paper.

Key Capabilities

  • Structured Evaluation: The model performs evaluations in distinct phases:
    • Phase 1: Compliance Check: Identifies the single most important, objective 'Gatekeeper Criterion' from the rubric, such as word limits or required output formats.
    • Phase 2: Response Analysis: Evaluates each response against all rubric criteria, providing step-by-step reasoning and citing concrete evidence.
    • Phase 3: Final Judgment: Aggregates findings to determine a winner (Response A or B) with a clear, concise justification.
  • Objective Assessment: Emphasizes objective criteria for initial compliance checks, distinguishing them from subjective quality judgments.
  • Detailed Justification: Requires explicit, step-by-step reasoning for all judgments, from gatekeeper identification to the final winner decision.
  • Specific Output Format: Adheres to a strict, predefined output format for consistent and parseable evaluation results.

Good for

  • Automated Response Grading: Ideal for systems requiring automated, rubric-based evaluation of LLM outputs.
  • Comparative Analysis: Useful for comparing the quality and compliance of two different AI responses (Response A vs. Response B).
  • Quality Assurance: Can be integrated into pipelines for ensuring AI-generated content meets specific guidelines and criteria.
  • Research in LLM Evaluation: Provides a structured approach to judging, which can be valuable for research into LLM performance and alignment.