AGI-Eval/UNO-Scorer-Qwen3-14B

Cold
Public
14B
FP8
32768
License: apache-2.0
Hugging Face
Overview

UNO-Scorer: A Unified General Scoring Model

UNO-Scorer is a lightweight, high-precision LLM-based evaluation model developed by meituan-longcat, built upon the Qwen3-14B backbone. It is specifically designed to automate the evaluation of Large Multimodal Models (LMMs) with minimal computational overhead.

Key Capabilities

  • Automated Evaluation: Takes a question, reference answer, and model response as input, then outputs a numerical score and detailed evaluation reasoning for each sub-question.
  • High Precision: Fine-tuned on 13K high-quality in-house data, achieving superior accuracy (0.9505) on its test set, surpassing proprietary models like GPT-4.1 for specific evaluation tasks.
  • Multi-Type Question Support: Overcomes limitations of traditional reward models by supporting 6 distinct question types, including Numerical, Enumeration, Multiple-Choice, Yes/No, Short-Answer, and Essay questions.
  • Multi-Step Open-Ended Questions (MO): Particularly excels in evaluating complex multi-step open-ended questions.
  • Structured Output: Provides detailed step-by-step analysis and a final score within <score>X</score> tags.

Usage and Optimization

  • HuggingFace Transformers: Easy integration with a minimal example provided, though a specific prompt template is critical for optimal performance.
  • vLLM Integration: Recommends using vLLM for production environments, offering 10-20x faster inference and better batching for large-scale evaluation tasks.
  • Language Preference: While English is supported, formatting reference answers in Chinese yields significantly better results due to the model's training data composition.

This model is ideal for researchers and developers needing an efficient and accurate tool for automated LMM evaluation, especially for complex, multi-part questions.