princeton-nlp/gemma-2-9b-it-SimPO

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:9BQuant:FP8Ctx Length:16kPublished:Jul 16, 2024License:mitArchitecture:Transformer0.2K Open Weights Warm

The princeton-nlp/gemma-2-9b-it-SimPO model is a 9 billion parameter causal language model developed by Yu Meng, Mengzhou Xia, and Danqi Chen, fine-tuned from Google's Gemma-2-9B-IT. It utilizes the SimPO (Simple Preference Optimization) algorithm, an offline preference optimization method that aligns the reward function with generation likelihood without needing a reference model. This model is specifically designed to enhance performance on preference optimization datasets, making it suitable for tasks requiring nuanced response generation based on human preferences.

Loading preview...

Model Overview

The princeton-nlp/gemma-2-9b-it-SimPO is a 9 billion parameter causal language model developed by Yu Meng, Mengzhou Xia, and Danqi Chen. It is fine-tuned from the google/gemma-2-9b-it base model using the SimPO (Simple Preference Optimization) algorithm. SimPO is an offline preference optimization technique that improves model training by aligning the reward function with the generation likelihood, eliminating the need for a separate reference model and incorporating a target reward margin.

Key Capabilities & Differentiators

  • Preference Optimization: Leverages the novel SimPO algorithm for effective preference-based fine-tuning.
  • Reference-Free Training: SimPO's design removes the dependency on a reference model, simplifying the training process.
  • Enhanced Performance: Evaluation results show improvements over the base gemma-2-9b-it model and competitive performance against a DPO-tuned variant on various benchmarks, particularly in areas like AE2 LC and AH.
  • Efficient Fine-tuning: The model was fine-tuned on 8xH100 GPUs in approximately 100 minutes using the princeton-nlp/gemma2-ultrafeedback-armorm dataset.

Ideal Use Cases

This model is well-suited for applications requiring language generation that is highly aligned with human preferences, especially in scenarios where preference optimization is critical. Its SimPO-based training makes it a strong candidate for tasks demanding nuanced and contextually appropriate responses, offering an alternative to traditional DPO methods.