PKU-ONELab/Themis
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Jun 27, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

Themis is an 8-billion parameter large language model developed by PKU-ONELab, specifically designed for reference-free Natural Language Generation (NLG) evaluation. It offers comprehensive capabilities including versatility across various NLG tasks, independence from reference texts, flexibility for customized evaluation criteria, and interpretability through analysis and explanations. Themis excels in evaluating tasks like summarization, dialogue response generation, and machine translation, demonstrating superior performance compared to other evaluation models, including GPT-4, across multiple benchmarks.

Loading preview...

Themis: A Comprehensive NLG Evaluation Model

Themis is an 8-billion parameter large language model (LLM) from PKU-ONELab, specifically engineered for Natural Language Generation (NLG) evaluation. Unlike many traditional methods, Themis operates in a reference-free manner, meaning it does not require human-written reference texts for comparison. This model is distinguished by four core characteristics:

Key Capabilities

  • Versatility: Evaluates a wide range of NLG tasks, including less common ones like question-answering evaluation.
  • Independence: Performs evaluations without relying on reference texts.
  • Flexibility: Allows users to define specific and customized evaluation aspects and criteria, from overall quality to fine-grained details.
  • Interpretability: Provides not only a rating but also corresponding analysis and explanations for its evaluations.

Performance Highlights

Experimental results show Themis achieves superior overall evaluation performance across various NLG tasks and datasets, including SummEval for summarization, Topical-Chat for dialogue, and WMT23 for machine translation. It has demonstrated higher average Spearman correlation compared to other evaluation models, including GPT-4, on several benchmarks. For instance, Themis-8B achieved an average Spearman of 0.542, outperforming GPT-4 Turbo's 0.521.

Good for

  • Developers and researchers needing a robust, reference-free NLG evaluation system.
  • Tasks requiring flexible and customizable evaluation criteria.
  • Scenarios where detailed explanations and analysis of evaluation scores are beneficial.