UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3 is a 7 billion parameter GPT-like model developed by UCLA-AGI, fine-tuned from Mistral-7B-Instruct-v0.2. It utilizes Self-Play Preference Optimization (SPPO) at its third iteration, trained on synthetic datasets derived from UltraFeedback. This model is specifically optimized for alignment, demonstrating improved win rates on benchmarks like AlpacaEval and Arena-Hard through iterative self-play.
No reviews yet. Be the first to review!