AALF/gemma-2-27b-it-SimPO-37K

Cold
Public
27B
FP8
32768
Aug 13, 2024
License: gemma
Hugging Face
Overview

Model Overview

AALF/gemma-2-27b-it-SimPO-37K is a specialized large language model derived from the google/gemma-2-27b-it base model. It has undergone further fine-tuning using the SimPO (Simple Preference Optimization) framework, which leverages a reference-free reward mechanism to enhance model performance based on preference data.

Training Methodology

The fine-tuning process involved applying On-Policy Preference Data Generation on the HuggingFaceH4/ultrafeedback_binarized dataset. The RLHFlow/ArmoRM-Llama3-8B-v0.1 reward model was used to annotate responses, and prompts where the chosen response had a significantly higher reward than the rejected response were selected, resulting in 37,040 training data points. Training was conducted on 8x80G A800 GPUs using deepspeed_zero_stage3 with optimizer offloading to the CPU, ensuring efficient resource utilization.

Key Characteristics

  • Preference Optimization: Utilizes the SimPO framework for alignment with human preferences, aiming for higher quality and more helpful responses.
  • Data-Driven Refinement: Benefits from a curated dataset of 37,040 high-quality preference examples derived from UltraFeedback.
  • Gemma 2 27B Base: Built upon the robust architecture of Google's Gemma 2 27B instruction-tuned model.

Potential Use Cases

This model is particularly well-suited for applications requiring:

  • High-quality instruction following and conversational AI.
  • Improved response generation in dialogue systems and chatbots.
  • Tasks where alignment with human preferences is critical for user satisfaction.