UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3 is a 9 billion parameter Gemma-2-9B-It-based causal language model developed by UCLA-AGI. This model is the third iteration fine-tuned using Self-Play Preference Optimization (SPPO) on synthetic datasets, primarily in English. It is optimized for improved alignment and response quality, demonstrating enhanced win rates on the AlpacaEval Leaderboard compared to previous iterations.
Loading preview...
Overview
UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3 is a 9 billion parameter instruction-tuned language model, building upon the google/gemma-2-9b-it architecture. Developed by UCLA-AGI, this model represents the third iteration of fine-tuning using Self-Play Preference Optimization (SPPO), a method detailed in the paper "Self-Play Preference Optimization for Language Model Alignment" (arXiv:2405.00675). The training utilized synthetic datasets derived from the openbmb/UltraFeedback dataset.
Key Capabilities & Features
- Self-Play Preference Optimization: Leverages an advanced alignment technique to refine model responses.
- Iterative Improvement: This is the third iteration, showing progressive enhancements in alignment and performance.
- Synthetic Data Training: Fine-tuned exclusively on synthetic prompts and responses, which can lead to distinct response characteristics.
- Gemma-2-9B-It Base: Benefits from the foundational capabilities of the Gemma-2-9B-It model.
Performance Highlights
Evaluations on the AlpacaEval Leaderboard demonstrate a consistent improvement across SPPO iterations:
- Iter3 Win Rate: Achieves a 53.27% LC. Win Rate and 47.74% Win Rate, outperforming Iter1 and Iter2.
Use Cases
This model is suitable for applications requiring a 9B parameter instruction-tuned model with enhanced alignment, particularly where response quality and adherence to preferences are critical. Its iterative SPPO training suggests it may excel in generating more preferred and aligned outputs compared to its base model or earlier iterations.