UCLA-AGI/Mistral7B-PairRM-SPPO

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:May 4, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

UCLA-AGI/Mistral7B-PairRM-SPPO is a 7 billion parameter GPT-like language model developed by UCLA-AGI, fine-tuned from Mistral-7B-Instruct-v0.2. It utilizes Self-Play Preference Optimization (SPPO) on synthetic datasets derived from UltraFeedback, focusing on alignment. This model demonstrates improved performance on AlpacaEval 2.0 by estimating soft probabilities using three samples during training, making it suitable for tasks requiring refined conversational alignment.

Loading preview...

Model Overview

UCLA-AGI/Mistral7B-PairRM-SPPO is a 7 billion parameter language model developed by UCLA-AGI, building upon the mistralai/Mistral-7B-Instruct-v0.2 architecture. Its core innovation lies in its training methodology: Self-Play Preference Optimization (SPPO), as detailed in the paper "Self-Play Preference Optimization for Language Model Alignment" (arXiv:2405.00675).

Key Capabilities & Training

  • Self-Play Preference Optimization (SPPO): The model was fine-tuned using an iterative SPPO approach, leveraging synthetic responses from the openbmb/UltraFeedback dataset, processed via snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset.
  • Enhanced Probability Estimation: During training, it uses three samples (winner, loser, and a random sample) to estimate soft probabilities, which has shown to deliver better performance on AlpacaEval 2.0 compared to the method described in the original paper.
  • Synthetic Data Training: All responses used for training are synthetic, indicating a focus on learning from generated, high-quality preference data.
  • Performance: On the AlpacaEval Leaderboard, this model achieves a 32.14% Win Rate and a 30.46% LC. Win Rate.

Important Note

The authors recommend referring to the original checkpoint, UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3, as reported in their paper, anticipating more consistent performance across evaluation tasks from that version.