Name: ai-forever/pollux-judge-7b API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ai-forever

Model Overview

pollux-judge-7b is a 7.6 billion parameter generative language model from ai-forever, built upon the t-tech/T-lite-it-1.0 architecture. It is specifically designed for evaluating the quality of other Large Language Models (LLMs) in Russian. The model operates in a sequence-to-sequence fashion, predicting both numerical scores and detailed textual rationales for LLM responses based on provided instructions, evaluation criteria, scoring rubrics, and reference answers.

Key Capabilities

Automated LLM Evaluation: Provides quantitative and qualitative assessments of LLM responses in Russian.
Criterion-Specific Scoring: Assesses text responses against a single, predefined criterion per evaluation run, generating scores and rationales.
Russian Language Focus: Optimized and trained using generative tasks and evaluation criteria derived from the POLLUX dataset, focusing on Russian-language content.
Synthetic Data Training: Trained on 1,000,000 synthetic samples generated from state-of-the-art LLMs (DeepSeekV3, GPT-4o, o3-mini) and diverse open-source models.

Performance

Evaluated using Spearman’s rank correlation with expert judgments and Mean Absolute Error (MAE) on the POLLUX dataset. It demonstrates competitive performance against reference models like DeepSeek-R1 and GPT-4o in specific metrics, particularly in Spearman's correlation and MAE for Russian LLM evaluation.

Good For

Developers and researchers needing automated, objective evaluation of Russian LLM outputs.
Assessing LLM response quality against specific criteria and rubrics.
Integrating into larger systems for continuous quality monitoring of Russian-speaking generative models.

Overview

Model Overview

Key Capabilities

Performance

Good For

Full Model Card (README)