OpenRubrics/RubricARROW-8B-Judge
OpenRubrics/RubricARROW-8B-Judge is an 8 billion parameter language model, fine-tuned from Qwen/Qwen3-8B. Developed by OpenRubrics, this model is specifically designed for evaluating LLM responses against predefined rubric items in non-verifiable domains. It functions as a judge model, providing detailed explanations and boolean assessments of whether criteria are met, making it ideal for automated quality assurance and post-training evaluation of other LLMs.
Loading preview...
OpenRubrics/RubricARROW-8B-Judge Overview
This model is an 8 billion parameter judge model, fine-tuned from the Qwen/Qwen3-8B architecture. It is developed by OpenRubrics as part of the RUBRIC-ARROW framework, which focuses on alternating pointwise rubric reward modeling for LLM post-training, particularly in domains where verification is challenging.
Key Capabilities
- Automated LLM Response Evaluation: Designed to assess the quality of LLM-generated responses against specific rubric items.
- Detailed Explanations: Provides a string explanation for why a response does or does not meet a given criterion.
- Boolean Criteria Assessment: Outputs a
true/falseboolean indicating whether each rubric item's criteria are fully met. - JSON Output Format: Structures its evaluation output as a single JSON object, making it easy for programmatic parsing.
- Rubric-Driven Scoring: Integrates with a probability-based scoring mechanism that can assign weights to different rubric tags (e.g., "hard rule," "principle").
Good For
- LLM Post-training and Fine-tuning: Ideal for generating reward signals to improve LLM performance in non-verifiable domains.
- Automated Quality Assurance: Can be used to automatically check if LLM outputs adhere to specific guidelines or requirements.
- Developer Tooling: Provides a structured way to evaluate and debug LLM responses based on explicit criteria.
This model is particularly useful for scenarios where human evaluation is costly or time-consuming, offering a scalable solution for assessing LLM output quality based on predefined rubrics.