Overview
Model Overview
AALF/gemma-2-27b-it-SimPO-37K-100steps is an instruction-tuned variant of the 27 billion parameter Google Gemma-2 model. It represents a 100-step checkpoint from a larger fine-tuning process, focusing on generating high-quality, preferred responses.
Key Capabilities & Training
- Preference Optimization: The model was fine-tuned using the SimPO framework, which involves on-policy preference data generation.
- Data Curation: Training data was generated from the HuggingFaceH4/ultrafeedback_binarized dataset, with responses evaluated by the RLHFlow/ArmoRM-Llama3-8B-v0.1 reward model. Only prompts where the chosen response's reward was significantly higher than the rejected response's reward were selected, resulting in 37,040 training examples.
- Performance: Achieves a 77.09% WinRate and a 79.16% LC WinRate on the AlpacaEval2.0 benchmark, indicating strong performance in generating preferred outputs.
- Technical Implementation: Training utilized 8x80G A800 GPUs with
deepspeed_zero_stage3and optimizer offloading to the CPU, leveraging thealignment-handbooklibrary.
Good for
- Applications requiring models optimized for generating high-quality, human-preferred responses.
- Tasks where robust instruction following and preference alignment are critical.
- Researchers and developers interested in models fine-tuned with the SimPO method.