AALF/gemma-2-27b-it-SimPO-37K-100steps

Cold
Public
27B
FP8
32768
Aug 13, 2024
License: gemma
Hugging Face
Overview

Model Overview

AALF/gemma-2-27b-it-SimPO-37K-100steps is an instruction-tuned variant of the 27 billion parameter Google Gemma-2 model. It represents a 100-step checkpoint from a larger fine-tuning process, focusing on generating high-quality, preferred responses.

Key Capabilities & Training

  • Preference Optimization: The model was fine-tuned using the SimPO framework, which involves on-policy preference data generation.
  • Data Curation: Training data was generated from the HuggingFaceH4/ultrafeedback_binarized dataset, with responses evaluated by the RLHFlow/ArmoRM-Llama3-8B-v0.1 reward model. Only prompts where the chosen response's reward was significantly higher than the rejected response's reward were selected, resulting in 37,040 training examples.
  • Performance: Achieves a 77.09% WinRate and a 79.16% LC WinRate on the AlpacaEval2.0 benchmark, indicating strong performance in generating preferred outputs.
  • Technical Implementation: Training utilized 8x80G A800 GPUs with deepspeed_zero_stage3 and optimizer offloading to the CPU, leveraging the alignment-handbook library.

Good for

  • Applications requiring models optimized for generating high-quality, human-preferred responses.
  • Tasks where robust instruction following and preference alignment are critical.
  • Researchers and developers interested in models fine-tuned with the SimPO method.