princeton-nlp/Mistral-7B-Base-SFT-RRHF

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:4kPublished:Jul 6, 2024Architecture:Transformer Cold

princeton-nlp/Mistral-7B-Base-SFT-RRHF is a 7 billion parameter language model developed by princeton-nlp, fine-tuned using the RRHF (Rank Responses to align with Human Feedback) method. This model is based on the Mistral-7B-Base architecture and is specifically designed for improved alignment with human preferences. It is suitable for tasks requiring nuanced response generation and preference optimization, building upon a 4096 token context length.

Loading preview...

Model Overview

This model, princeton-nlp/Mistral-7B-Base-SFT-RRHF, is a 7 billion parameter language model developed by princeton-nlp. It is a fine-tuned version of the Mistral-7B-Base architecture, specifically optimized using the RRHF (Rank Responses to align with Human Feedback) method. This approach aims to enhance the model's alignment with human preferences, making its outputs more desirable and relevant based on human feedback signals.

Key Characteristics

  • Architecture: Based on the efficient Mistral-7B-Base model.
  • Parameter Count: 7 billion parameters, offering a balance between performance and computational efficiency.
  • Context Length: Supports a context window of 4096 tokens.
  • Training Method: Utilizes RRHF (Rank Responses to align with Human Feedback) for preference optimization, as detailed in the associated research paper.

Primary Differentiator

What sets this model apart is its specific fine-tuning with the RRHF method, which is designed to improve alignment with human preferences. This makes it particularly effective for applications where the quality and human-likeness of generated responses are critical, moving beyond standard supervised fine-tuning.

Potential Use Cases

  • Preference-aligned generation: Ideal for tasks where outputs need to closely match human judgments or preferences.
  • Dialogue systems: Can be used to generate more natural and preferred responses in conversational AI.
  • Content creation: Suitable for generating text that is more likely to be rated highly by human evaluators.

For more technical details, refer to the associated preprint: SimPO: Simple Preference Optimization with a Reference-Free Reward and the repository.