Omni-Judge: Automated Mathematical Solution Evaluation
Omni-Judge is an 8 billion parameter model developed by KbsdJames, specifically engineered for the automated evaluation of mathematical solutions. Built on the meta-llama/Llama-3.1-8B-Instruct architecture, it functions as a "GPT-4-as-a-judge" for mathematical problems, offering a more efficient and cost-effective alternative to manual or rule-based assessment.
Key Capabilities
- Automated Mathematical Assessment: Evaluates whether a model-generated solution to a mathematical problem is correct when provided with the problem statement and a reference answer.
- High Agreement with GPT-4o: Instruction-tuned on 17,618 examples of GPT-4o evaluation data, Omni-Judge demonstrates approximately 91% agreement with GPT-4o's judgments on unseen test samples.
- Efficiency: Designed to provide automated assessment with greater efficiency and lower cost compared to complex rule-based methods or direct use of larger models.
- Benchmark Application: Applicable to various mathematical reasoning benchmarks, including the proposed Omni-MATH benchmark.
Performance Highlights
Omni-Judge exhibits strong consistency in its evaluations. For instance, it achieved 95.79% consistency with GPT-4o judgments when evaluating outputs from Mathstral-7B-v0.1, and 94.01% consistency for DeepSeek-Coder-V2, showcasing its robust performance across different model outputs.
Good for
- Developers needing an automated, scalable solution for evaluating mathematical reasoning capabilities of LLMs.
- Researchers working on mathematical benchmarks and requiring consistent, cost-effective assessment tools.
- Integrating into pipelines for continuous evaluation of models generating mathematical solutions.