TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-8B-R

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:May 17, 2024Architecture:Transformer0.0K Warm

SFR-Iterative-DPO-LLaMA-3-8B-R is an 8 billion parameter instruct model developed by Salesforce, based on the LLaMA-3 architecture with an 8192 token context length. It utilizes an iterative DPO-based online RLHF training method, enabling it to outperform models of similar size and many larger open-source and proprietary models on instruct benchmarks like Alpaca-Eval-V2, MT-Bench, and Chat-Arena-Hard. This model is optimized for instruction following and general conversational AI tasks, achieving strong performance without relying on additional human or GPT-4 labeling.

Loading preview...

Model Overview

SFR-Iterative-DPO-LLaMA-3-8B-R is an 8 billion parameter instruct model from Salesforce, built upon the LLaMA-3 architecture. It distinguishes itself through an innovative iterative DPO (Direct Preference Optimization) based online RLHF (Reinforcement Learning from Human Feedback) training approach. This method is designed to be more efficient and simpler than PPO-based alternatives, while effectively mitigating distribution shifts during policy optimization.

Key Capabilities & Performance

This model demonstrates state-of-the-art performance within its class, surpassing other 8B models like LLaMA-3-8B-it, many larger open-source models (e.g., Mixtral-8x7B-it), and even strong proprietary models such as GPT-3.5-turbo-0613 on key instruct benchmarks. Notably, it achieves:

  • 37.2 on Alpaca-Eval-V2
  • 8.46 on MT-Bench
  • 29.1 on Chat-Arena-Hard

These results are achieved using only open-sourced datasets, without reliance on additional human or GPT-4 labeling. While excelling in instruct benchmarks, its academic benchmark scores (e.g., GSM-8K, MMLU, HumanEval) are competitive with other LLaMA-3-8B variants.

When to Use This Model

  • Instruction Following: Ideal for applications requiring high-quality responses to user instructions and prompts.
  • Conversational AI: Suitable for chatbots and interactive agents where strong performance on chat benchmarks is crucial.
  • Resource-Efficient Deployment: Offers competitive performance at an 8B parameter scale, making it a strong candidate for scenarios where larger models might be too resource-intensive.

Limitations

As a research model, SFR-Iterative-DPO-LLaMA-3-8B-R may still generate offensive or unethical content under adversarial conditions, despite integrated safety and ethical considerations in its alignment process. Users are encouraged to use it responsibly.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p