Name: PKU-ONELab/Themis API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: PKU-ONELab

Themis: A Comprehensive NLG Evaluation Model

Themis is an 8-billion parameter large language model (LLM) from PKU-ONELab, specifically engineered for Natural Language Generation (NLG) evaluation. Unlike many traditional methods, Themis operates in a reference-free manner, meaning it does not require human-written reference texts for comparison. This model is distinguished by four core characteristics:

Key Capabilities

Versatility: Evaluates a wide range of NLG tasks, including less common ones like question-answering evaluation.
Independence: Performs evaluations without relying on reference texts.
Flexibility: Allows users to define specific and customized evaluation aspects and criteria, from overall quality to fine-grained details.
Interpretability: Provides not only a rating but also corresponding analysis and explanations for its evaluations.

Performance Highlights

Experimental results show Themis achieves superior overall evaluation performance across various NLG tasks and datasets, including SummEval for summarization, Topical-Chat for dialogue, and WMT23 for machine translation. It has demonstrated higher average Spearman correlation compared to other evaluation models, including GPT-4, on several benchmarks. For instance, Themis-8B achieved an average Spearman of 0.542, outperforming GPT-4 Turbo's 0.521.

Good for

Developers and researchers needing a robust, reference-free NLG evaluation system.
Tasks requiring flexible and customizable evaluation criteria.
Scenarios where detailed explanations and analysis of evaluation scores are beneficial.

Overview

Themis: A Comprehensive NLG Evaluation Model

Key Capabilities

Performance Highlights

Good for

Full Model Card (README)