princeton-nlp/Mistral-7B-Instruct-RDPO

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:May 17, 2024Architecture:Transformer Cold

Mistral-7B-Instruct-RDPO is a 7 billion parameter instruction-tuned language model developed by princeton-nlp, based on the Mistral architecture with a 4096-token context length. This model is fine-tuned using the Reference-Free DPO (RDPO) method, as detailed in the SimPO preprint, distinguishing it through its unique preference optimization approach. It is primarily designed for general instruction-following tasks, leveraging its specialized training to enhance response quality without requiring a reference reward model.

Loading preview...

Overview

princeton-nlp/Mistral-7B-Instruct-RDPO is a 7 billion parameter instruction-tuned language model. This model is a direct release from the research presented in the preprint, "SimPO: Simple Preference Optimization with a Reference-Free Reward." It leverages a novel training methodology known as Reference-Free DPO (RDPO), which aims to optimize model preferences without the need for an explicit reference reward model.

Key Capabilities

  • Instruction Following: Designed to accurately follow a wide range of user instructions.
  • Preference Optimization: Utilizes the RDPO method for fine-tuning, offering a distinct approach to aligning model outputs with desired preferences.

Good for

  • Researchers and developers interested in exploring advanced preference optimization techniques, particularly those involving reference-free methods.
  • Applications requiring a 7B instruction-tuned model with a unique training paradigm for improved response quality.

For more in-depth technical details and implementation, refer to the SimPO repository and the associated preprint.