Overview
UCLA-AGI/Gemma-2-9B-It-SPPO-Iter1 is a 9 billion parameter language model, specifically the first iteration of a Gemma-2-9B-It base model fine-tuned using Self-Play Preference Optimization (SPPO). Developed by UCLA-AGI, this model leverages synthetic datasets derived from the UltraFeedback dataset for its alignment process. The training utilized a single epoch with a learning rate of 5e-07 and an RMSProp optimizer.
Key Characteristics
- Base Model: Fine-tuned from
google/gemma-2-9b-it. - Alignment Method: Employs Self-Play Preference Optimization (SPPO) as described in the paper "Self-Play Preference Optimization for Language Model Alignment" (arXiv:2405.00675).
- Training Data: Utilizes synthetic responses generated from prompt sets within the
openbmb/UltraFeedback dataset, split into three parts for iterative training. - Language: Primarily English.
- License: Apache-2.0.
Performance Insights
While this is the first iteration, subsequent iterations (Iter2 and Iter3) of the SPPO process show progressive improvements in metrics like LC. Win Rate and overall Win Rate on the AlpacaEval Leaderboard, suggesting the effectiveness of the SPPO method. For instance, the Iter3 model achieves a 53.27% LC. Win Rate and 47.74% Win Rate on AlpacaEval, indicating the potential for further improvement in this model series.