ai-forever/pollux-judge-32b

Hugging Face
TEXT GENERATIONConcurrency Cost:2Model Size:32.8BQuant:FP8Ctx Length:32kPublished:Jun 5, 2025License:mitArchitecture:Transformer0.0K Open Weights Warm

ai-forever/pollux-judge-32b is a 32.8 billion parameter decoder-based generative language model developed by ai-forever, specifically designed for evaluating the quality of other language models' responses in Russian. Built upon the t-tech/T-pro-it-1.0 architecture, it predicts numerical scores and detailed textual rationales for LLM outputs based on instructions, criteria, and rubrics. This model excels at automated LLM performance evaluation for Russian-language tasks, processing one criterion per evaluation run.

Loading preview...

Overview

pollux-judge-32b is a 32.8 billion parameter decoder-based generative language model from ai-forever, specifically engineered to evaluate the quality of other language models' responses in Russian. It is a core component of the POLLUX project, which focuses on comprehensive LLM evaluation using systematic taxonomies for tasks and criteria. The model predicts both numerical scores and detailed textual rationales for LLM outputs, considering the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and optional reference answers.

Key Capabilities

  • Automated LLM Evaluation: Designed to assess the quality of Russian-language LLM responses.
  • Score and Rationale Generation: Provides both a numerical score and a textual explanation for its judgment.
  • Criterion-Specific Assessment: Optimized to evaluate responses against a single, predefined criterion per run, requiring explicit specification of criteria and rubrics.
  • Russian Language Focus: Specifically trained and optimized for tasks and evaluation criteria derived from the POLLUX dataset, which is focused on Russian.

Training and Performance

The model was fine-tuned from t-tech/T-pro-it-1.0 using a synthetically generated dataset of 1,000,000 samples, created to align with the POLLUX dataset's taxonomies. Evaluation against human expert judgments on the POLLUX dataset shows competitive performance in Spearman's rank correlation and Mean Absolute Error (MAE) compared to models like DeepSeek-R1 and GPT-4o, particularly in out-of-domain scenarios. For instance, its average Spearman's rank correlation is 0.627 and average MAE is 0.483 across various LLMs.

When to Use This Model

This model is ideal for developers and researchers who need to programmatically evaluate the quality of Russian-language LLM outputs. It is best used when you have a clear instruction, an LLM response, and a specific evaluation criterion with corresponding rubrics. It is not intended for autonomous criterion determination or simultaneous multi-criterion evaluation.