Model Overview
UCLA-AGI/Mistral7B-PairRM-SPPO is a 7 billion parameter language model developed by UCLA-AGI, building upon the mistralai/Mistral-7B-Instruct-v0.2 architecture. Its core innovation lies in its training methodology: Self-Play Preference Optimization (SPPO), as detailed in the paper "Self-Play Preference Optimization for Language Model Alignment" (arXiv:2405.00675).
Key Capabilities & Training
- Self-Play Preference Optimization (SPPO): The model was fine-tuned using an iterative SPPO approach, leveraging synthetic responses from the
openbmb/UltraFeedback dataset, processed via snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset. - Enhanced Probability Estimation: During training, it uses three samples (winner, loser, and a random sample) to estimate soft probabilities, which has shown to deliver better performance on AlpacaEval 2.0 compared to the method described in the original paper.
- Synthetic Data Training: All responses used for training are synthetic, indicating a focus on learning from generated, high-quality preference data.
- Performance: On the AlpacaEval Leaderboard, this model achieves a 32.14% Win Rate and a 30.46% LC. Win Rate.
Important Note
The authors recommend referring to the original checkpoint, UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3, as reported in their paper, anticipating more consistent performance across evaluation tasks from that version.