Salesforce/LLaMA-3-8B-SFR-Iterative-DPO-R
Salesforce/LLaMA-3-8B-SFR-Iterative-DPO-R is an 8 billion parameter instruct model developed by Salesforce, based on the LLaMA-3 architecture with an 8192 token context length. It is distinguished by its iterative DPO-based online RLHF training method, which enables it to outperform many larger open-source and some proprietary models on instruct benchmarks like Alpaca-Eval-V2, MT-Bench, and Chat-Arena-Hard. This model is optimized for general instruction following and conversational AI tasks.
Loading preview...
Overview
Salesforce/LLaMA-3-8B-SFR-Iterative-DPO-R is an 8 billion parameter instruct model developed by Salesforce. It leverages an iterative DPO-based online Reinforcement Learning from Human Feedback (RLHF) training approach, which is noted for its efficiency and simplicity compared to PPO-based methods, and effectively mitigates distribution shifts during policy optimization.
Key Capabilities & Performance
This model demonstrates strong performance across various instruct benchmarks, often surpassing models of similar size and even some larger open-source models like Mixtral-8x7B-it, as well as proprietary models such as GPT-3.5-turbo-0613. Key benchmark results include:
- Alpaca-Eval-V2: 31.3
- MT-Bench: 8.46
- Chat-Arena-Hard: 29.1
It achieves these results using only open-sourced datasets, without reliance on additional human or GPT-4 labeling. While excelling in instruct tasks, its academic benchmark scores for reasoning and coding tasks (e.g., GSM-8K, HumanEval) are competitive with other LLaMA-3-8B variants.
Good For
- General instruction following and conversational AI applications.
- Use cases requiring a highly capable 8B parameter model that performs comparably to or better than many larger alternatives on instruct benchmarks.
- Research into efficient online RLHF methods and DPO-based training.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.