Overview
UCLA-AGI/Mistral7B-PairRM-SPPO-Iter2 is a 7 billion parameter language model developed by UCLA-AGI, building upon the mistralai/Mistral-7B-Instruct-v0.2 architecture. This model is the second iteration in a series that employs Self-Play Preference Optimization (SPPO) for alignment, as detailed in the paper "Self-Play Preference Optimization for Language Model Alignment." It was fine-tuned using synthetic responses generated from the openbmb/UltraFeedback dataset, specifically a split from snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset.
Key Capabilities & Differentiators
- Self-Play Preference Optimization (SPPO): Leverages an iterative self-play mechanism to enhance model alignment and response quality, generating 5 responses per iteration (K=5).
- Synthetic Data Training: Aligned exclusively on synthetic datasets, demonstrating the effectiveness of this approach for preference optimization.
- Improved Alignment: Shows progressive improvements in alignment metrics across iterations, with Iteration 2 achieving a 27.62% Win Rate on AlpacaEval and an average MT-Bench score of 7.49.
- Mistral-7B Base: Benefits from the strong foundational capabilities of the Mistral-7B-Instruct-v0.2 model.
Evaluation Highlights
- AlpacaEval: Achieved a 27.62% Win Rate (32.12% with best-of-16 sampling) on AlpacaEval, indicating strong performance in instruction following and helpfulness.
- MT-Bench: Scored an average of 7.49, reflecting good conversational abilities.
- Open LLM Leaderboard: Maintained competitive performance across various academic benchmarks, with an average score of 66.75.
When to Use This Model
This model is particularly suitable for applications requiring a 7B parameter model with enhanced alignment and high-quality, instruction-following responses, especially in scenarios where synthetic data-driven alignment is a focus. It represents a specific iteration in the SPPO research, offering insights into the progression of alignment quality.