opencompass/CompassJudger-1-32B-Instruct
The opencompass/CompassJudger-1-32B-Instruct is a 32.8 billion parameter instruction-tuned language model developed by Opencompass, designed as an all-in-one judge model. It excels in various evaluation methods, including scoring and comparison, and can output detailed assessment reviews in specified formats. This model is optimized for comprehensive evaluation tasks and can also perform general instruction-following tasks, offering versatility for both judging and typical LLM applications.
Loading preview...
CompassJudger-1: All-in-one Judge Model
CompassJudger-1, developed by Opencompass, is a 32.8 billion parameter instruction-tuned model primarily designed for comprehensive evaluation tasks. It functions as an "all-in-one" judge, capable of performing various assessment methods and generating detailed reviews in specified formats.
Key Capabilities:
- Comprehensive Evaluation: Supports multiple evaluation methods, including scoring, comparison, and detailed assessment feedback.
- Formatted Output: Can output evaluation results in a specific, user-defined format for easier analysis.
- Versatility: Beyond its judging capabilities, it can also handle general instruction-following tasks, similar to a standard instruction model.
- Inference Acceleration: Supports acceleration methods like vLLM and LMdeploy for efficient model inference.
JudgerBench and Subjective Evaluation:
Opencompass has introduced JudgerBench, a new benchmark to standardize the evaluation of judging models. CompassJudger-1 is integral to this benchmark, helping to identify more effective evaluator models. Users can test their judge models on JudgerBench using provided OpenCompass scripts and contribute to the leaderboard.
Additionally, CompassJudger-1 can be used within the OpenCompass framework to evaluate common subjective datasets, such as Alignbench, by configuring it as a judge model for other LLMs. A separate subjective evaluation leaderboard showcases its performance in this area.
Good for:
- Automated evaluation of LLM responses through scoring and comparison.
- Generating structured, detailed feedback for model assessment.
- General instruction-following tasks where a versatile model with strong judging capabilities is beneficial.
- Developers looking to integrate a robust, format-aware judge into their evaluation pipelines.