UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:Jun 25, 2024License:apache-2.0Architecture:Transformer0.1K Open Weights Warm

UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3 is an 8 billion parameter instruction-tuned language model developed by UCLA-AGI, based on Meta-Llama-3-8B-Instruct. It was fine-tuned using Self-Play Preference Optimization (SPPO) over three iterations, utilizing synthetic datasets derived from UltraFeedback prompts. This model demonstrates improved alignment and performance on benchmarks like AlpacaEval and Open LLM Leaderboard compared to its previous iterations, making it suitable for general instruction-following tasks.

Loading preview...

Overview

UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3 is an 8 billion parameter instruction-tuned model developed by UCLA-AGI. It is built upon the meta-llama/Meta-Llama-3-8B-Instruct architecture and has undergone three iterations of fine-tuning using Self-Play Preference Optimization (SPPO). The training utilized synthetic responses generated from the openbmb/UltraFeedback dataset, specifically split into three parts for iterative refinement.

Key Capabilities & Performance

This model shows progressive improvements across its iterations. On the AlpacaEval Leaderboard, Iter3 achieves a 38.77% LC. Win Rate and a 39.85% Win Rate, outperforming Iter1 and Iter2. Similarly, on the Open LLM Leaderboard, Iter3 records an average score of 70.29%, with notable scores in arc_challenge (65.19%) and hellaswag (80.86%). The model is primarily English-language focused and is licensed under Apache-2.0.

Training Methodology

The SPPO method aims to enhance language model alignment. The training involved specific hyperparameters including a learning rate of 5e-07, RMSProp optimizer, and a linear learning rate scheduler with a warmup ratio of 0.1. The iterative training process, as detailed in the associated research paper, leverages self-play to refine model preferences.

Good For

  • General instruction-following tasks
  • Applications requiring improved alignment compared to base Llama-3-8B-Instruct
  • Research into Self-Play Preference Optimization techniques

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p