UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:May 4, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3 is a 7 billion parameter GPT-like model developed by UCLA-AGI, fine-tuned from Mistral-7B-Instruct-v0.2. It utilizes Self-Play Preference Optimization (SPPO) at its third iteration, trained on synthetic datasets derived from UltraFeedback. This model is specifically optimized for alignment, demonstrating improved win rates on benchmarks like AlpacaEval and Arena-Hard through iterative self-play.

Loading preview...

Model Overview

UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3 is a 7 billion parameter language model developed by UCLA-AGI, building upon the mistralai/Mistral-7B-Instruct-v0.2 architecture. This model represents the third iteration of fine-tuning using a novel approach called Self-Play Preference Optimization (SPPO). The training process involved synthetic responses generated from the openbmb/UltraFeedback dataset, split across three iterations.

Key Capabilities and Differentiators

  • Self-Play Preference Optimization: Leverages an iterative self-play mechanism to enhance model alignment and performance, as detailed in the associated research paper.
  • Synthetic Data Training: Entirely trained on synthetic responses, demonstrating the effectiveness of this data generation strategy for alignment.
  • Improved Alignment: Shows progressive improvements in win rates across iterations on preference-based benchmarks like AlpacaEval and Arena-Hard, indicating better alignment with human preferences.
  • Benchmarked Performance: Detailed evaluation results are provided for AlpacaEval, Arena-Hard, Open LLM Leaderboard, and MT-Bench, allowing for quantitative assessment of its capabilities.

Ideal Use Cases

  • Research in Alignment: Excellent for researchers studying preference optimization, self-play mechanisms, and synthetic data training for LLM alignment.
  • Preference-Based Tasks: Suitable for applications where models need to generate responses that align closely with human preferences, as indicated by its strong performance on relevant benchmarks.
  • Comparative Analysis: Can be used as a baseline or comparison model for evaluating new alignment techniques, especially within the 7B parameter class.