The wzhouad/gemma-2-9b-it-WPO-HB model is a 9 billion parameter Gemma-2 instruction-tuned language model developed by wzhouad. It utilizes a novel Weighted Preference Optimization (WPO) method to enhance off-policy preference learning by reweighting data to simulate on-policy learning. This model is specifically fine-tuned using a hybrid dataset of Ultrafeedback prompts and GPT-4-turbo outputs, demonstrating strong performance on benchmarks like AlpacaEval with a 76.73% win rate. It is designed for tasks requiring robust preference alignment and improved response quality.
No reviews yet. Be the first to review!