alvarobartt/Mistral-7B-v0.1-ORPO
alvarobartt/Mistral-7B-v0.1-ORPO is a 7 billion parameter language model fine-tuned by alvarobartt using the ORPO (Odds Ratio Preference Optimization) method on the Mistral-7B-v0.1 base model. This model leverages a single-stage preference optimization technique, combining supervised fine-tuning and reinforcement learning from human feedback, which makes it faster to train and less memory-intensive than traditional DPO/PPO methods. It is particularly optimized for tasks requiring preference-based fine-tuning, achieving strong performance on various benchmarks with a 4096 token context length.
Loading preview...
Overview
alvarobartt/Mistral-7B-v0.1-ORPO is a 7 billion parameter language model, fine-tuned from the mistralai/Mistral-7B-v0.1 base model. This model utilizes the experimental ORPO (Odds Ratio Preference Optimization) method, which integrates both supervised fine-tuning (SFT) and preference optimization (like DPO/PPO) into a single training stage. This approach aims to streamline the fine-tuning process, making it faster and more memory-efficient by eliminating the need for a separate reference model.
Key Capabilities & Features
- Single-Stage Preference Optimization: Employs ORPO, a novel method that combines SFT and preference alignment into one training phase, reducing training time and memory footprint.
- Preference Data Driven: Fine-tuned using a preference dataset,
alvarobartt/dpo-mix-7k-simplified, which consists of prompt, chosen, and rejected response pairs. - Efficient Training: Benefits from ORPO's design, which is noted for being faster to train and requiring less memory compared to multi-stage PPO/DPO methods.
- Strong Performance: The ORPO method has shown state-of-the-art results for 7B parameter models like Mistral, often outperforming larger counterparts in specific benchmarks.
Good For
- Developers looking for a Mistral-7B variant optimized with a cutting-edge, efficient preference alignment technique.
- Applications requiring models fine-tuned on preference datasets for improved response quality and alignment.
- Experimentation with the ORPO fine-tuning paradigm, especially for those interested in single-stage preference optimization.