Name: UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: UCLA-AGI

Overview

UCLA-AGI/Mistral7B-PairRM-SPPO-Iter1 is a 7 billion parameter language model developed by UCLA-AGI, representing the first iteration of a model fine-tuned with Self-Play Preference Optimization (SPPO). This model is built upon the mistralai/Mistral-7B-Instruct-v0.2 architecture and utilizes synthetic responses generated from the openbmb/UltraFeedback dataset for its alignment process. The development and methodology are detailed in the paper "Self-Play Preference Optimization for Language Model Alignment" (arXiv:2405.00675).

Key Characteristics

Self-Play Preference Optimization (SPPO): This model is a direct result of applying the SPPO technique, generating 5 responses per iteration (K=5) to refine its alignment.
Synthetic Data Training: Fine-tuned exclusively on synthetic datasets, demonstrating an alternative approach to preference optimization.
Iterative Development: This is the first of several iterative models, with subsequent iterations (Iter2, Iter3) showing progressive improvements in evaluation metrics.

Evaluation Highlights

The model's performance is benchmarked across several leaderboards:

AlpacaEval: Achieved a Win Rate of 23.51% (LC. Win Rate 24.79%) for Iteration 1.
Open LLM Leaderboard: Scores include 65.02 on arc_challenge, 69.4 on truthfulqa_mc2, and an average of 66.67.
MT-Bench: Recorded an average score of 7.21.

Use Cases

This model is particularly relevant for researchers and developers interested in:

Exploring advanced alignment techniques like Self-Play Preference Optimization.
Understanding the impact of synthetic data in fine-tuning large language models.
Benchmarking and comparing iterative improvements in preference-optimized models.

Overview

Overview

Key Characteristics

Evaluation Highlights

Use Cases

Full Model Card (README)