Name: Salesforce/LLaMA-3-8B-SFR-Iterative-DPO-R API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Salesforce

Overview

Salesforce/LLaMA-3-8B-SFR-Iterative-DPO-R is an 8 billion parameter instruct model developed by Salesforce. It leverages an iterative DPO-based online Reinforcement Learning from Human Feedback (RLHF) training approach, which is noted for its efficiency and simplicity compared to PPO-based methods, and effectively mitigates distribution shifts during policy optimization.

Key Capabilities & Performance

This model demonstrates strong performance across various instruct benchmarks, often surpassing models of similar size and even some larger open-source models like Mixtral-8x7B-it, as well as proprietary models such as GPT-3.5-turbo-0613. Key benchmark results include:

Alpaca-Eval-V2: 31.3
MT-Bench: 8.46
Chat-Arena-Hard: 29.1

It achieves these results using only open-sourced datasets, without reliance on additional human or GPT-4 labeling. While excelling in instruct tasks, its academic benchmark scores for reasoning and coding tasks (e.g., GSM-8K, HumanEval) are competitive with other LLaMA-3-8B variants.

Good For

General instruction following and conversational AI applications.
Use cases requiring a highly capable 8B parameter model that performs comparably to or better than many larger alternatives on instruct benchmarks.
Research into efficient online RLHF methods and DPO-based training.

Overview

Overview

Key Capabilities & Performance

Good For

Full Model Card (README)