Model Overview
UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter2 is an 8 billion parameter instruction-tuned model developed by UCLA-AGI. It is built upon the meta-llama/Meta-Llama-3-8B-Instruct architecture and represents the second iteration of fine-tuning using Self-Play Preference Optimization (SPPO). This method leverages synthetic responses from the openbmb/UltraFeedback dataset for alignment.
Key Capabilities & Performance
- Self-Play Preference Optimization: Utilizes an iterative self-play approach for alignment, aiming to enhance model performance through preference learning.
- Improved Alignment: Demonstrates an increased win rate on the AlpacaEval Leaderboard, achieving 35.98% compared to Iter1's 31.74%.
- General Language Tasks: Shows competitive performance on the Open LLM Leaderboard with an average score of 69.91 across benchmarks like MMLU, Hellaswag, and GSM8k.
- Synthetic Data Training: Fine-tuned exclusively on synthetic datasets, which can influence its response generation characteristics.
When to Use This Model
This model is suitable for applications requiring a Llama-3-8B-Instruct base with enhanced alignment through SPPO. It can be particularly useful for tasks where improved instruction following and preference-based response generation are critical, especially when comparing performance against earlier SPPO iterations or the base Llama-3-8B-Instruct model.