UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1
UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1 is a 7 billion parameter GPT-like language model developed by UCLA-AGI, fine-tuned using Self-Play Preference Optimization (SPPO) at its first iteration. Based on the Mistral-7B-Instruct-v0.2 architecture, this model is aligned using synthetic datasets derived from UltraFeedback prompts. It is specifically designed to demonstrate the effectiveness of the SPPO method for language model alignment, as detailed in the associated research paper.
Loading preview...
Overview
UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1 is a 7 billion parameter language model developed by UCLA-AGI, representing the first iteration of a model fine-tuned with Self-Play Preference Optimization (SPPO). This model is built upon the mistralai/Mistral-7B-Instruct-v0.2 architecture and utilizes synthetic responses generated from the openbmb/UltraFeedback dataset for its alignment process. The development and methodology are detailed in the paper "Self-Play Preference Optimization for Language Model Alignment" (arXiv:2405.00675).
Key Characteristics
- Self-Play Preference Optimization (SPPO): This model is a direct result of applying the SPPO technique, generating 5 responses per iteration (K=5) to refine its alignment.
- Synthetic Data Training: Fine-tuned exclusively on synthetic datasets, demonstrating an alternative approach to preference optimization.
- Iterative Development: This is the first of several iterative models, with subsequent iterations (Iter2, Iter3) showing progressive improvements in evaluation metrics.
Evaluation Highlights
The model's performance is benchmarked across several leaderboards:
- AlpacaEval: Achieved a Win Rate of 23.51% (LC. Win Rate 24.79%) for Iteration 1.
- Open LLM Leaderboard: Scores include 65.02 on arc_challenge, 69.4 on truthfulqa_mc2, and an average of 66.67.
- MT-Bench: Recorded an average score of 7.21.
Use Cases
This model is particularly relevant for researchers and developers interested in:
- Exploring advanced alignment techniques like Self-Play Preference Optimization.
- Understanding the impact of synthetic data in fine-tuning large language models.
- Benchmarking and comparing iterative improvements in preference-optimized models.