kaist-ai/mistral-orpo-beta

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:8kPublished:Mar 12, 2024License:mitArchitecture:Transformer0.0K Open Weights Cold

kaist-ai/mistral-orpo-beta is a 7 billion parameter language model developed by KAIST AI, fine-tuned from Mistral-7B-v0.1 using the Odds Ratio Preference Optimization (ORPO) method. This model directly learns preferences without a supervised fine-tuning warmup, distinguishing it from traditional alignment techniques. It is specifically optimized for conversational AI and instruction following, demonstrating strong performance on benchmarks like MT-Bench and AlpacaEval.

Loading preview...

Overview

kaist-ai/mistral-orpo-beta is a 7 billion parameter language model based on Mistral-7B-v0.1, developed by KAIST AI. Its key differentiator is the use of Odds Ratio Preference Optimization (ORPO), a novel alignment technique that allows the model to learn preferences directly, bypassing the need for an initial supervised fine-tuning phase. This approach simplifies the alignment process and aims for more efficient preference learning.

Key Capabilities & Performance

  • ORPO Alignment: Utilizes the ORPO method, fine-tuned exclusively on 61k instances of the cleaned argilla/ultrafeedback-binarized-preferences-cleaned dataset.
  • Strong Conversational Performance: Achieves an MT-Bench score of 7.32, outperforming models like Zephyr β (7.34) and TULU-2-DPO (7.00) in its size class, and significantly surpassing Llama-2-Chat models.
  • High Preference Alignment: Demonstrates strong results on AlpacaEval 2.0 with a score of 12.20, indicating effective alignment with human preferences.
  • Instruction Following: Shows competitive performance on IFEval, with scores of 0.5287 (Prompt-Strict) and 0.6355 (Inst-Strict), suggesting good adherence to instructions.

When to Use This Model

  • Conversational AI: Ideal for chatbots and dialogue systems where high-quality, aligned responses are crucial.
  • Instruction Following: Suitable for tasks requiring the model to accurately follow complex instructions.
  • Preference Learning Research: A valuable model for researchers exploring alternative alignment methods like ORPO.