UCLA-AGI/Mistral7B-PairRM-SPPO-Iter2

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:May 4, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

UCLA-AGI/Mistral7B-PairRM-SPPO-Iter2 is a 7 billion parameter GPT-like model developed by UCLA-AGI, fine-tuned from Mistral-7B-Instruct-v0.2. This model utilizes Self-Play Preference Optimization (SPPO) at its second iteration, specifically aligned using synthetic datasets derived from UltraFeedback. It is designed for improved alignment and response quality, as evidenced by its performance on various benchmarks including AlpacaEval and MT-Bench.

Loading preview...

Overview

UCLA-AGI/Mistral7B-PairRM-SPPO-Iter2 is a 7 billion parameter language model developed by UCLA-AGI, building upon the mistralai/Mistral-7B-Instruct-v0.2 architecture. This model is the second iteration in a series that employs Self-Play Preference Optimization (SPPO) for alignment, as detailed in the paper "Self-Play Preference Optimization for Language Model Alignment." It was fine-tuned using synthetic responses generated from the openbmb/UltraFeedback dataset, specifically a split from snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset.

Key Capabilities & Differentiators

  • Self-Play Preference Optimization (SPPO): Leverages an iterative self-play mechanism to enhance model alignment and response quality, generating 5 responses per iteration (K=5).
  • Synthetic Data Training: Aligned exclusively on synthetic datasets, demonstrating the effectiveness of this approach for preference optimization.
  • Improved Alignment: Shows progressive improvements in alignment metrics across iterations, with Iteration 2 achieving a 27.62% Win Rate on AlpacaEval and an average MT-Bench score of 7.49.
  • Mistral-7B Base: Benefits from the strong foundational capabilities of the Mistral-7B-Instruct-v0.2 model.

Evaluation Highlights

  • AlpacaEval: Achieved a 27.62% Win Rate (32.12% with best-of-16 sampling) on AlpacaEval, indicating strong performance in instruction following and helpfulness.
  • MT-Bench: Scored an average of 7.49, reflecting good conversational abilities.
  • Open LLM Leaderboard: Maintained competitive performance across various academic benchmarks, with an average score of 66.75.

When to Use This Model

This model is particularly suitable for applications requiring a 7B parameter model with enhanced alignment and high-quality, instruction-following responses, especially in scenarios where synthetic data-driven alignment is a focus. It represents a specific iteration in the SPPO research, offering insights into the progression of alignment quality.