wzhouad/gemma-2-9b-it-WPO-HB
wzhouad/gemma-2-9b-it-WPO-HB: Enhanced Preference Optimization
wzhouad/gemma-2-9b-it-WPO-HB is a 9 billion parameter Gemma-2 instruction-tuned model that introduces a novel Weighted Preference Optimization (WPO) method. WPO addresses the distributional gap in off-policy preference learning by reweighting preference pairs, making off-policy data more closely resemble on-policy data. This approach enhances the optimization process without additional computational costs.
Key Capabilities & Features
- Weighted Preference Optimization (WPO): A unique training strategy that simulates on-policy learning using off-policy preference data, improving alignment and response quality.
- Hybrid Data Training: Fine-tuned on a combination of on-policy sampled Gemma outputs and GPT-4-turbo outputs, both based on Ultrafeedback prompts.
- Performance: Achieves a 76.73% win rate on AlpacaEval, indicating strong performance in instruction following and helpfulness.
- Efficient Optimization: Enhances preference optimization without incurring additional training costs.
Training Details
The model was fine-tuned using a learning rate of 1e-06, a beta of 0.01, and a batch size of 1 with 16 gradient accumulation steps over 2 epochs. The maximum sequence length for training was 2048 tokens, with prompts up to 1800 tokens.
Use Cases
This model is particularly well-suited for applications requiring high-quality, preference-aligned responses, especially in scenarios where robust instruction following and helpfulness are critical. Its WPO method makes it a strong candidate for tasks benefiting from advanced preference learning techniques.