Name: Skywork/Skywork-Critic-Llama-3.1-70B API
Brand: Featherless.ai
Price: 25.00 USD
Availability: InStock
Author: Skywork

Overview

Skywork-Critic-Llama-3.1-70B is a 70 billion parameter judge model from SkyworkAI, fine-tuned on Meta's Llama-3.1-70B-Instruct. It is designed for advanced pairwise preference evaluation, providing detailed judgments on the quality and suitability of AI-generated responses. The model leverages a diverse set of high-quality datasets, including cleaned open-source data like HelpSteer2 and Magpie DPO series, in-house human annotations, and synthetic critic data generated using a self-taught approach.

Key Capabilities

Pairwise Preference Evaluation: Compares and assesses input pairs to determine relative quality, crucial for data improvement and reward modeling.
High Performance: Achieves the top rank on the RewardBench leaderboard for generative models across all sizes, with an overall score of 93.3.
Detailed Judgment: Can act as a judge to generate scores and rationales for instruction-response pairs, as demonstrated by its ability to provide comprehensive evaluations.
Data Selection: Functions as a preference data selector, distinguishing between chosen and rejected training data for Direct Preference Optimization (DPO).

Training Details

The model was instruction-tuned using a combination of:

Open-source datasets: Including subsets of HelpSteer2, OffsetBias, WildGuard, and various Magpie DPO series datasets, along with critic datasets like Open-Critic-GPT.
In-house human annotation data: Pointwise scoring and pairwise comparisons, including rationales.
Synthetic critic data: Generated by creating similar instructions or introducing subtle errors into high-quality responses.
Critic-related chat data: To maintain conversational capabilities.

Use Cases

AI Model Evaluation: Objectively assess the quality of responses from other generative models.
Reward Modeling: Generate preference data for training reward models in reinforcement learning from human feedback (RLHF).
Data Curation: Select high-quality data for fine-tuning and improving other language models.