Salesforce/LLaMA-3-8B-SFR-Iterative-DPO-R

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:May 9, 2024License:llama3Architecture:Transformer0.1K Warm

Salesforce/LLaMA-3-8B-SFR-Iterative-DPO-R is an 8 billion parameter instruct model developed by Salesforce, based on the LLaMA-3 architecture with an 8192 token context length. It is distinguished by its iterative DPO-based online RLHF training method, which enables it to outperform many larger open-source and some proprietary models on instruct benchmarks like Alpaca-Eval-V2, MT-Bench, and Chat-Arena-Hard. This model is optimized for general instruction following and conversational AI tasks.

Loading preview...

Overview

Salesforce/LLaMA-3-8B-SFR-Iterative-DPO-R is an 8 billion parameter instruct model developed by Salesforce. It leverages an iterative DPO-based online Reinforcement Learning from Human Feedback (RLHF) training approach, which is noted for its efficiency and simplicity compared to PPO-based methods, and effectively mitigates distribution shifts during policy optimization.

Key Capabilities & Performance

This model demonstrates strong performance across various instruct benchmarks, often surpassing models of similar size and even some larger open-source models like Mixtral-8x7B-it, as well as proprietary models such as GPT-3.5-turbo-0613. Key benchmark results include:

  • Alpaca-Eval-V2: 31.3
  • MT-Bench: 8.46
  • Chat-Arena-Hard: 29.1

It achieves these results using only open-sourced datasets, without reliance on additional human or GPT-4 labeling. While excelling in instruct tasks, its academic benchmark scores for reasoning and coding tasks (e.g., GSM-8K, HumanEval) are competitive with other LLaMA-3-8B variants.

Good For

  • General instruction following and conversational AI applications.
  • Use cases requiring a highly capable 8B parameter model that performs comparably to or better than many larger alternatives on instruct benchmarks.
  • Research into efficient online RLHF methods and DPO-based training.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p