Overview
pollux-judge-32b is a 32.8 billion parameter decoder-based generative language model from ai-forever, specifically engineered to evaluate the quality of other language models' responses in Russian. It is a core component of the POLLUX project, which focuses on comprehensive LLM evaluation using systematic taxonomies for tasks and criteria. The model predicts both numerical scores and detailed textual rationales for LLM outputs, considering the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and optional reference answers.
Key Capabilities
- Automated LLM Evaluation: Designed to assess the quality of Russian-language LLM responses.
- Score and Rationale Generation: Provides both a numerical score and a textual explanation for its judgment.
- Criterion-Specific Assessment: Optimized to evaluate responses against a single, predefined criterion per run, requiring explicit specification of criteria and rubrics.
- Russian Language Focus: Specifically trained and optimized for tasks and evaluation criteria derived from the POLLUX dataset, which is focused on Russian.
Training and Performance
The model was fine-tuned from t-tech/T-pro-it-1.0 using a synthetically generated dataset of 1,000,000 samples, created to align with the POLLUX dataset's taxonomies. Evaluation against human expert judgments on the POLLUX dataset shows competitive performance in Spearman's rank correlation and Mean Absolute Error (MAE) compared to models like DeepSeek-R1 and GPT-4o, particularly in out-of-domain scenarios. For instance, its average Spearman's rank correlation is 0.627 and average MAE is 0.483 across various LLMs.
When to Use This Model
This model is ideal for developers and researchers who need to programmatically evaluate the quality of Russian-language LLM outputs. It is best used when you have a clear instruction, an LLM response, and a specific evaluation criterion with corresponding rubrics. It is not intended for autonomous criterion determination or simultaneous multi-criterion evaluation.