kaist-ai/mistral-orpo-beta
kaist-ai/mistral-orpo-beta is a 7 billion parameter language model developed by KAIST AI, fine-tuned from Mistral-7B-v0.1 using the Odds Ratio Preference Optimization (ORPO) method. This model directly learns preferences without a supervised fine-tuning warmup, distinguishing it from traditional alignment techniques. It is specifically optimized for conversational AI and instruction following, demonstrating strong performance on benchmarks like MT-Bench and AlpacaEval.
Loading preview...
Overview
kaist-ai/mistral-orpo-beta is a 7 billion parameter language model based on Mistral-7B-v0.1, developed by KAIST AI. Its key differentiator is the use of Odds Ratio Preference Optimization (ORPO), a novel alignment technique that allows the model to learn preferences directly, bypassing the need for an initial supervised fine-tuning phase. This approach simplifies the alignment process and aims for more efficient preference learning.
Key Capabilities & Performance
- ORPO Alignment: Utilizes the ORPO method, fine-tuned exclusively on 61k instances of the cleaned
argilla/ultrafeedback-binarized-preferences-cleaneddataset. - Strong Conversational Performance: Achieves an MT-Bench score of 7.32, outperforming models like Zephyr β (7.34) and TULU-2-DPO (7.00) in its size class, and significantly surpassing Llama-2-Chat models.
- High Preference Alignment: Demonstrates strong results on AlpacaEval 2.0 with a score of 12.20, indicating effective alignment with human preferences.
- Instruction Following: Shows competitive performance on IFEval, with scores of 0.5287 (Prompt-Strict) and 0.6355 (Inst-Strict), suggesting good adherence to instructions.
When to Use This Model
- Conversational AI: Ideal for chatbots and dialogue systems where high-quality, aligned responses are crucial.
- Instruction Following: Suitable for tasks requiring the model to accurately follow complex instructions.
- Preference Learning Research: A valuable model for researchers exploring alternative alignment methods like ORPO.