Name: UCLA-AGI/Mistral7B-PairRM-SPPO API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: UCLA-AGI

Model Overview

UCLA-AGI/Mistral7B-PairRM-SPPO is a 7 billion parameter language model developed by UCLA-AGI, building upon the mistralai/Mistral-7B-Instruct-v0.2 architecture. Its core innovation lies in its training methodology: Self-Play Preference Optimization (SPPO), as detailed in the paper "Self-Play Preference Optimization for Language Model Alignment" (arXiv:2405.00675).

Key Capabilities & Training

Self-Play Preference Optimization (SPPO): The model was fine-tuned using an iterative SPPO approach, leveraging synthetic responses from the openbmb/UltraFeedback dataset, processed via snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset.
Enhanced Probability Estimation: During training, it uses three samples (winner, loser, and a random sample) to estimate soft probabilities, which has shown to deliver better performance on AlpacaEval 2.0 compared to the method described in the original paper.
Synthetic Data Training: All responses used for training are synthetic, indicating a focus on learning from generated, high-quality preference data.
Performance: On the AlpacaEval Leaderboard, this model achieves a 32.14% Win Rate and a 30.46% LC. Win Rate.

Important Note

The authors recommend referring to the original checkpoint, UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3, as reported in their paper, anticipating more consistent performance across evaluation tasks from that version.

Overview

Model Overview

Key Capabilities & Training

Important Note

Full Model Card (README)