li-muyang/zephyr-7b-gemma-dpo

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 21, 2025Architecture:Transformer Cold

The li-muyang/zephyr-7b-gemma-dpo is an 8 billion parameter language model based on the Gemma architecture, fine-tuned using Direct Preference Optimization (DPO). This model was trained from scratch and shows specific reward metrics for chosen and rejected responses, indicating its DPO-based alignment. It is intended for tasks benefiting from preference-based fine-tuning, though specific use cases require further definition.

Loading preview...

Model Overview

The li-muyang/zephyr-7b-gemma-dpo is an 8 billion parameter language model built upon the Gemma architecture. It has been fine-tuned using the Direct Preference Optimization (DPO) method, which aligns the model with human preferences by distinguishing between chosen and rejected responses.

Training Details

The model was trained from scratch over 2 epochs using a learning rate of 5e-07 and a total batch size of 128 across 8 GPUs. Key training metrics include a final validation loss of 0.5287, with rewards for chosen responses at 0.4753 and rejected responses at -0.1181, indicating successful preference learning. The training procedure utilized Adam optimizer with specific beta and epsilon values, and a cosine learning rate scheduler with a 0.1 warmup ratio.

Key Characteristics

  • Architecture: Gemma-based, 8 billion parameters.
  • Fine-tuning: Direct Preference Optimization (DPO) for alignment.
  • Performance Indicators: Achieved a reward accuracy of 0.7708, demonstrating its ability to differentiate preferred outputs.

Intended Use Cases

While specific intended uses and limitations require further definition, models fine-tuned with DPO are generally well-suited for tasks requiring nuanced understanding of user preferences, such as instruction following, dialogue generation, and content summarization where quality and alignment are paramount.