Skywork-Critic-Llama-3.1-70B is a 70 billion parameter judge model developed by SkyworkAI, built upon Meta's Llama-3.1-70B-Instruct architecture. It specializes in pairwise preference evaluation, offering nuanced judgments on the relative quality or suitability of input pairs. This model excels at data improvement, evaluation, and reward modeling, achieving the top rank on the RewardBench leaderboard for generative models across all sizes.
Loading preview...
Overview
Skywork-Critic-Llama-3.1-70B is a 70 billion parameter judge model from SkyworkAI, fine-tuned on Meta's Llama-3.1-70B-Instruct. It is designed for advanced pairwise preference evaluation, providing detailed judgments on the quality and suitability of AI-generated responses. The model leverages a diverse set of high-quality datasets, including cleaned open-source data like HelpSteer2 and Magpie DPO series, in-house human annotations, and synthetic critic data generated using a self-taught approach.
Key Capabilities
- Pairwise Preference Evaluation: Compares and assesses input pairs to determine relative quality, crucial for data improvement and reward modeling.
- High Performance: Achieves the top rank on the RewardBench leaderboard for generative models across all sizes, with an overall score of 93.3.
- Detailed Judgment: Can act as a judge to generate scores and rationales for instruction-response pairs, as demonstrated by its ability to provide comprehensive evaluations.
- Data Selection: Functions as a preference data selector, distinguishing between chosen and rejected training data for Direct Preference Optimization (DPO).
Training Details
The model was instruction-tuned using a combination of:
- Open-source datasets: Including subsets of HelpSteer2, OffsetBias, WildGuard, and various Magpie DPO series datasets, along with critic datasets like Open-Critic-GPT.
- In-house human annotation data: Pointwise scoring and pairwise comparisons, including rationales.
- Synthetic critic data: Generated by creating similar instructions or introducing subtle errors into high-quality responses.
- Critic-related chat data: To maintain conversational capabilities.
Use Cases
- AI Model Evaluation: Objectively assess the quality of responses from other generative models.
- Reward Modeling: Generate preference data for training reward models in reinforcement learning from human feedback (RLHF).
- Data Curation: Select high-quality data for fine-tuning and improving other language models.