Overview
UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3 is a 9 billion parameter instruction-tuned language model, building upon the google/gemma-2-9b-it architecture. Developed by UCLA-AGI, this model represents the third iteration of fine-tuning using Self-Play Preference Optimization (SPPO), a method detailed in the paper "Self-Play Preference Optimization for Language Model Alignment" (arXiv:2405.00675). The training utilized synthetic datasets derived from the openbmb/UltraFeedback dataset.
Key Capabilities & Features
- Self-Play Preference Optimization: Leverages an advanced alignment technique to refine model responses.
- Iterative Improvement: This is the third iteration, showing progressive enhancements in alignment and performance.
- Synthetic Data Training: Fine-tuned exclusively on synthetic prompts and responses, which can lead to distinct response characteristics.
- Gemma-2-9B-It Base: Benefits from the foundational capabilities of the Gemma-2-9B-It model.
Performance Highlights
Evaluations on the AlpacaEval Leaderboard demonstrate a consistent improvement across SPPO iterations:
- Iter3 Win Rate: Achieves a 53.27% LC. Win Rate and 47.74% Win Rate, outperforming Iter1 and Iter2.
Use Cases
This model is suitable for applications requiring a 9B parameter instruction-tuned model with enhanced alignment, particularly where response quality and adherence to preferences are critical. Its iterative SPPO training suggests it may excel in generating more preferred and aligned outputs compared to its base model or earlier iterations.