alvarobartt/Mistral-7B-v0.1-ORPO

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Mar 21, 2024License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

alvarobartt/Mistral-7B-v0.1-ORPO is a 7 billion parameter language model fine-tuned by alvarobartt using the ORPO (Odds Ratio Preference Optimization) method on the Mistral-7B-v0.1 base model. This model leverages a single-stage preference optimization technique, combining supervised fine-tuning and reinforcement learning from human feedback, which makes it faster to train and less memory-intensive than traditional DPO/PPO methods. It is particularly optimized for tasks requiring preference-based fine-tuning, achieving strong performance on various benchmarks with a 4096 token context length.

Loading preview...

Overview

alvarobartt/Mistral-7B-v0.1-ORPO is a 7 billion parameter language model, fine-tuned from the mistralai/Mistral-7B-v0.1 base model. This model utilizes the experimental ORPO (Odds Ratio Preference Optimization) method, which integrates both supervised fine-tuning (SFT) and preference optimization (like DPO/PPO) into a single training stage. This approach aims to streamline the fine-tuning process, making it faster and more memory-efficient by eliminating the need for a separate reference model.

Key Capabilities & Features

  • Single-Stage Preference Optimization: Employs ORPO, a novel method that combines SFT and preference alignment into one training phase, reducing training time and memory footprint.
  • Preference Data Driven: Fine-tuned using a preference dataset, alvarobartt/dpo-mix-7k-simplified, which consists of prompt, chosen, and rejected response pairs.
  • Efficient Training: Benefits from ORPO's design, which is noted for being faster to train and requiring less memory compared to multi-stage PPO/DPO methods.
  • Strong Performance: The ORPO method has shown state-of-the-art results for 7B parameter models like Mistral, often outperforming larger counterparts in specific benchmarks.

Good For

  • Developers looking for a Mistral-7B variant optimized with a cutting-edge, efficient preference alignment technique.
  • Applications requiring models fine-tuned on preference datasets for improved response quality and alignment.
  • Experimentation with the ORPO fine-tuning paradigm, especially for those interested in single-stage preference optimization.