Name: KbsdJames/Omni-Judge API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: KbsdJames

Omni-Judge: Automated Mathematical Solution Evaluation

Omni-Judge is an 8 billion parameter model developed by KbsdJames, specifically engineered for the automated evaluation of mathematical solutions. Built on the meta-llama/Llama-3.1-8B-Instruct architecture, it functions as a "GPT-4-as-a-judge" for mathematical problems, offering a more efficient and cost-effective alternative to manual or rule-based assessment.

Key Capabilities

Automated Mathematical Assessment: Evaluates whether a model-generated solution to a mathematical problem is correct when provided with the problem statement and a reference answer.
High Agreement with GPT-4o: Instruction-tuned on 17,618 examples of GPT-4o evaluation data, Omni-Judge demonstrates approximately 91% agreement with GPT-4o's judgments on unseen test samples.
Efficiency: Designed to provide automated assessment with greater efficiency and lower cost compared to complex rule-based methods or direct use of larger models.
Benchmark Application: Applicable to various mathematical reasoning benchmarks, including the proposed Omni-MATH benchmark.

Performance Highlights

Omni-Judge exhibits strong consistency in its evaluations. For instance, it achieved 95.79% consistency with GPT-4o judgments when evaluating outputs from Mathstral-7B-v0.1, and 94.01% consistency for DeepSeek-Coder-V2, showcasing its robust performance across different model outputs.

Good for

Developers needing an automated, scalable solution for evaluating mathematical reasoning capabilities of LLMs.
Researchers working on mathematical benchmarks and requiring consistent, cost-effective assessment tools.
Integrating into pipelines for continuous evaluation of models generating mathematical solutions.

Overview

Omni-Judge: Automated Mathematical Solution Evaluation

Key Capabilities

Performance Highlights

Good for

Full Model Card (README)