Name: wzhouad/gemma-2-9b-it-WPO-HB API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: wzhouad

wzhouad/gemma-2-9b-it-WPO-HB: Enhanced Preference Optimization

wzhouad/gemma-2-9b-it-WPO-HB is a 9 billion parameter Gemma-2 instruction-tuned model that introduces a novel Weighted Preference Optimization (WPO) method. WPO addresses the distributional gap in off-policy preference learning by reweighting preference pairs, making off-policy data more closely resemble on-policy data. This approach enhances the optimization process without additional computational costs.

Key Capabilities & Features

Weighted Preference Optimization (WPO): A unique training strategy that simulates on-policy learning using off-policy preference data, improving alignment and response quality.
Hybrid Data Training: Fine-tuned on a combination of on-policy sampled Gemma outputs and GPT-4-turbo outputs, both based on Ultrafeedback prompts.
Performance: Achieves a 76.73% win rate on AlpacaEval, indicating strong performance in instruction following and helpfulness.
Efficient Optimization: Enhances preference optimization without incurring additional training costs.

Training Details

The model was fine-tuned using a learning rate of 1e-06, a beta of 0.01, and a batch size of 1 with 16 gradient accumulation steps over 2 epochs. The maximum sequence length for training was 2048 tokens, with prompts up to 1800 tokens.

Use Cases

This model is particularly well-suited for applications requiring high-quality, preference-aligned responses, especially in scenarios where robust instruction following and helpfulness are critical. Its WPO method makes it a strong candidate for tasks benefiting from advanced preference learning techniques.

Overview

wzhouad/gemma-2-9b-it-WPO-HB: Enhanced Preference Optimization

Key Capabilities & Features

Training Details

Use Cases

Full Model Card (README)