TIGERScore-13B: An Explainable, Reference-Free Text Generation Metric
TIGERScore-13B, developed by TIGER-Lab, is a 13 billion parameter model based on LLaMA-2, specifically designed as a trained metric for evaluating text generation. Unlike traditional metrics that often rely on reference texts or are limited to specific domains, TIGERScore operates reference-free and provides explainable error analysis.
Key Capabilities
- Instruction-Guided Evaluation: Evaluates text generation based on natural language instructions.
- Detailed Error Analysis: Pinpoints errors in generated text by identifying location, aspect, explanation, and assigning penalty scores.
- Reference-Free: Assesses quality without needing a ground-truth reference output.
- Broad Task Coverage: Trained on the MetricInstruct dataset, covering 6 text generation tasks (e.g., Summarization, Translation, Data2Text, Long-form QA, MathQA, Instruction Following) and 23 datasets.
- High Correlation with Human Ratings: Demonstrates superior correlation with human judgments compared to many existing metrics, both reference-based and reference-free, across various tasks.
Good For
- Automated Evaluation of LLM Outputs: Ideal for developers and researchers needing to automatically assess the quality of text generated by large language models.
- Debugging and Improving Text Generation: The detailed error explanations can help in understanding specific weaknesses in generation models and guide improvements.
- Research in Text Evaluation: Offers a powerful, interpretable, and easy-to-use tool for advancing research in universal explainable metrics for text generation.