ai-forever/Pollux-4B-Judge
ai-forever/Pollux-4B-Judge is a 4-billion parameter generative language model developed by ai-forever, built upon the Qwen/Qwen3-4B architecture. This decoder-based model is specifically designed to evaluate the quality of other language models' responses in Russian, predicting numerical scores and detailed textual rationales. It excels at automated LLM performance evaluation for Russian-language tasks, assessing answer quality against specific criteria and rubrics.
Loading preview...
Pollux-4B-Judge: A Specialized LLM for Russian Response Evaluation
Pollux-4B-Judge is a 4-billion parameter generative language model from ai-forever, specifically engineered to evaluate the quality of other LLMs' responses in Russian. Built on the Qwen/Qwen3-4B architecture, this decoder-based model predicts both numerical scores and detailed textual rationales for given instructions, LLM responses, evaluation criteria, and scoring rubrics.
Key Capabilities
- Automated LLM Evaluation: Designed for systematic assessment of generative LLMs, particularly for Russian-language tasks.
- Criterion-Based Scoring: Evaluates responses against a single, predefined criterion per run, providing scores and rationales.
- POLLUX Project Integration: An integral component of the POLLUX project, utilizing its taxonomies for generative tasks and evaluation criteria.
- Synthetic Data Training: Trained on 1,000,000 synthetic samples generated from state-of-the-art LLMs like DeepSeekV3 and GPT-4o, ensuring consistency with the POLLUX framework.
When to Use This Model
- Evaluating Russian LLM Performance: Ideal for developers and researchers needing to quantitatively and qualitatively assess the output of Russian-speaking LLMs.
- Automating Quality Control: Useful for integrating automated evaluation into LLM development pipelines.
- Research on LLM-as-a-Judge: Provides a robust tool for studying and implementing LLM-based evaluation methodologies.
Limitations
- Single Criterion Focus: Optimized for evaluating one criterion at a time; using multiple criteria simultaneously may yield unpredictable results.
- Requires Explicit Criteria: Not designed to autonomously determine evaluation criteria; explicit specification is necessary for reliable assessments.