Model Overview
pollux-judge-7b is a 7.6 billion parameter generative language model from ai-forever, built upon the t-tech/T-lite-it-1.0 architecture. It is specifically designed for evaluating the quality of other Large Language Models (LLMs) in Russian. The model operates in a sequence-to-sequence fashion, predicting both numerical scores and detailed textual rationales for LLM responses based on provided instructions, evaluation criteria, scoring rubrics, and reference answers.
Key Capabilities
- Automated LLM Evaluation: Provides quantitative and qualitative assessments of LLM responses in Russian.
- Criterion-Specific Scoring: Assesses text responses against a single, predefined criterion per evaluation run, generating scores and rationales.
- Russian Language Focus: Optimized and trained using generative tasks and evaluation criteria derived from the POLLUX dataset, focusing on Russian-language content.
- Synthetic Data Training: Trained on 1,000,000 synthetic samples generated from state-of-the-art LLMs (DeepSeekV3, GPT-4o, o3-mini) and diverse open-source models.
Performance
Evaluated using Spearman’s rank correlation with expert judgments and Mean Absolute Error (MAE) on the POLLUX dataset. It demonstrates competitive performance against reference models like DeepSeek-R1 and GPT-4o in specific metrics, particularly in Spearman's correlation and MAE for Russian LLM evaluation.
Good For
- Developers and researchers needing automated, objective evaluation of Russian LLM outputs.
- Assessing LLM response quality against specific criteria and rubrics.
- Integrating into larger systems for continuous quality monitoring of Russian-speaking generative models.