KbsdJames/Omni-Judge

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Sep 13, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

KbsdJames/Omni-Judge is an 8 billion parameter mathematical evaluation model built upon Meta Llama-3.1-8B-Instruct, designed to automatically assess the correctness of model-generated mathematical solutions against a reference answer. It was instruction-tuned using GPT-4o evaluation data, achieving approximately 91% agreement with GPT-4o on an internal test set. This model specializes in efficient and cost-effective automated assessment for complex mathematical reasoning problems, similar to GPT-4-as-a-judge.

Loading preview...

Omni-Judge: Automated Mathematical Solution Evaluation

Omni-Judge is an 8 billion parameter model developed by KbsdJames, specifically engineered for the automated evaluation of mathematical solutions. Built on the meta-llama/Llama-3.1-8B-Instruct architecture, it functions as a "GPT-4-as-a-judge" for mathematical problems, offering a more efficient and cost-effective alternative to manual or rule-based assessment.

Key Capabilities

  • Automated Mathematical Assessment: Evaluates whether a model-generated solution to a mathematical problem is correct when provided with the problem statement and a reference answer.
  • High Agreement with GPT-4o: Instruction-tuned on 17,618 examples of GPT-4o evaluation data, Omni-Judge demonstrates approximately 91% agreement with GPT-4o's judgments on unseen test samples.
  • Efficiency: Designed to provide automated assessment with greater efficiency and lower cost compared to complex rule-based methods or direct use of larger models.
  • Benchmark Application: Applicable to various mathematical reasoning benchmarks, including the proposed Omni-MATH benchmark.

Performance Highlights

Omni-Judge exhibits strong consistency in its evaluations. For instance, it achieved 95.79% consistency with GPT-4o judgments when evaluating outputs from Mathstral-7B-v0.1, and 94.01% consistency for DeepSeek-Coder-V2, showcasing its robust performance across different model outputs.

Good for

  • Developers needing an automated, scalable solution for evaluating mathematical reasoning capabilities of LLMs.
  • Researchers working on mathematical benchmarks and requiring consistent, cost-effective assessment tools.
  • Integrating into pipelines for continuous evaluation of models generating mathematical solutions.