wzhouad/gemma-2-9b-it-WPO-HB

TEXT GENERATIONConcurrency Cost:1Model Size:9BQuant:FP8Ctx Length:16kPublished:Aug 8, 2024Architecture:Transformer0.0K Cold

The wzhouad/gemma-2-9b-it-WPO-HB model is a 9 billion parameter Gemma-2 instruction-tuned language model developed by wzhouad. It utilizes a novel Weighted Preference Optimization (WPO) method to enhance off-policy preference learning by reweighting data to simulate on-policy learning. This model is specifically fine-tuned using a hybrid dataset of Ultrafeedback prompts and GPT-4-turbo outputs, demonstrating strong performance on benchmarks like AlpacaEval with a 76.73% win rate. It is designed for tasks requiring robust preference alignment and improved response quality.

Loading preview...

wzhouad/gemma-2-9b-it-WPO-HB: Enhanced Preference Optimization

wzhouad/gemma-2-9b-it-WPO-HB is a 9 billion parameter Gemma-2 instruction-tuned model that introduces a novel Weighted Preference Optimization (WPO) method. WPO addresses the distributional gap in off-policy preference learning by reweighting preference pairs, making off-policy data more closely resemble on-policy data. This approach enhances the optimization process without additional computational costs.

Key Capabilities & Features

  • Weighted Preference Optimization (WPO): A unique training strategy that simulates on-policy learning using off-policy preference data, improving alignment and response quality.
  • Hybrid Data Training: Fine-tuned on a combination of on-policy sampled Gemma outputs and GPT-4-turbo outputs, both based on Ultrafeedback prompts.
  • Performance: Achieves a 76.73% win rate on AlpacaEval, indicating strong performance in instruction following and helpfulness.
  • Efficient Optimization: Enhances preference optimization without incurring additional training costs.

Training Details

The model was fine-tuned using a learning rate of 1e-06, a beta of 0.01, and a batch size of 1 with 16 gradient accumulation steps over 2 epochs. The maximum sequence length for training was 2048 tokens, with prompts up to 1800 tokens.

Use Cases

This model is particularly well-suited for applications requiring high-quality, preference-aligned responses, especially in scenarios where robust instruction following and helpfulness are critical. Its WPO method makes it a strong candidate for tasks benefiting from advanced preference learning techniques.