Overview
UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1 is a 7 billion parameter language model developed by UCLA-AGI, representing the first iteration of a model fine-tuned with Self-Play Preference Optimization (SPPO). This model is built upon the mistralai/Mistral-7B-Instruct-v0.2 architecture and utilizes synthetic responses generated from the openbmb/UltraFeedback dataset for its alignment process. The development and methodology are detailed in the paper "Self-Play Preference Optimization for Language Model Alignment" (arXiv:2405.00675).
Key Characteristics
- Self-Play Preference Optimization (SPPO): This model is a direct result of applying the SPPO technique, generating 5 responses per iteration (K=5) to refine its alignment.
- Synthetic Data Training: Fine-tuned exclusively on synthetic datasets, demonstrating an alternative approach to preference optimization.
- Iterative Development: This is the first of several iterative models, with subsequent iterations (Iter2, Iter3) showing progressive improvements in evaluation metrics.
Evaluation Highlights
The model's performance is benchmarked across several leaderboards:
- AlpacaEval: Achieved a Win Rate of 23.51% (LC. Win Rate 24.79%) for Iteration 1.
- Open LLM Leaderboard: Scores include 65.02 on arc_challenge, 69.4 on truthfulqa_mc2, and an average of 66.67.
- MT-Bench: Recorded an average score of 7.21.
Use Cases
This model is particularly relevant for researchers and developers interested in:
- Exploring advanced alignment techniques like Self-Play Preference Optimization.
- Understanding the impact of synthetic data in fine-tuning large language models.
- Benchmarking and comparing iterative improvements in preference-optimized models.