UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3

Warm
Public
7B
FP8
4096
May 4, 2024
License: apache-2.0
Hugging Face
Overview

Model Overview

UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3 is a 7 billion parameter language model developed by UCLA-AGI, building upon the mistralai/Mistral-7B-Instruct-v0.2 architecture. This model represents the third iteration of fine-tuning using a novel approach called Self-Play Preference Optimization (SPPO). The training process involved synthetic responses generated from the openbmb/UltraFeedback dataset, split across three iterations.

Key Capabilities and Differentiators

  • Self-Play Preference Optimization: Leverages an iterative self-play mechanism to enhance model alignment and performance, as detailed in the associated research paper.
  • Synthetic Data Training: Entirely trained on synthetic responses, demonstrating the effectiveness of this data generation strategy for alignment.
  • Improved Alignment: Shows progressive improvements in win rates across iterations on preference-based benchmarks like AlpacaEval and Arena-Hard, indicating better alignment with human preferences.
  • Benchmarked Performance: Detailed evaluation results are provided for AlpacaEval, Arena-Hard, Open LLM Leaderboard, and MT-Bench, allowing for quantitative assessment of its capabilities.

Ideal Use Cases

  • Research in Alignment: Excellent for researchers studying preference optimization, self-play mechanisms, and synthetic data training for LLM alignment.
  • Preference-Based Tasks: Suitable for applications where models need to generate responses that align closely with human preferences, as indicated by its strong performance on relevant benchmarks.
  • Comparative Analysis: Can be used as a baseline or comparison model for evaluating new alignment techniques, especially within the 7B parameter class.