Name: ai-forever/Pollux-4B-Judge API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ai-forever

Overview

Pollux-4B-Judge is a 4-billion parameter generative language model from ai-forever, specifically engineered to evaluate the quality of Russian-language LLM responses. Built on the Qwen/Qwen3-4B architecture, it functions as a decoder-based model trained in a sequence-to-sequence fashion.

Key Capabilities

Automated LLM Evaluation: Designed to assess the quality of other language models' outputs in Russian.
Score and Rationale Generation: Predicts both numerical scores and detailed textual rationales for responses.
Criterion-Based Assessment: Evaluates answers based on input instruction, the LLM's response, specific evaluation criteria, and scoring rubrics.
Optimized for Russian: Training is specifically optimized using generative tasks and evaluation criteria from the POLLUX dataset.

Performance

Pollux-4B-Judge demonstrates strong performance in evaluating Russian LLM responses. When compared against other models, it achieves a RMSE of 0.568, a macro F1 of 0.705, and a Spearman's rank correlation of 0.744 with expert judgments, outperforming several larger models in these metrics.

Intended Use

This model is best used for assessing text responses against a single, predefined criterion per evaluation run, requiring explicit instruction, the response to be evaluated, the specific criterion, and its corresponding scoring rubrics. It is not designed for simultaneous multi-criteria processing or autonomous criterion determination.

Overview

Overview

Key Capabilities

Performance

Intended Use

Full Model Card (README)