TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-8B-R
SFR-Iterative-DPO-LLaMA-3-8B-R is an 8 billion parameter instruct model developed by Salesforce, based on the LLaMA-3 architecture with an 8192 token context length. It utilizes an iterative DPO-based online RLHF training method, enabling it to outperform models of similar size and many larger open-source and proprietary models on instruct benchmarks like Alpaca-Eval-V2, MT-Bench, and Chat-Arena-Hard. This model is optimized for instruction following and general conversational AI tasks, achieving strong performance without relying on additional human or GPT-4 labeling.
Loading preview...
Model Overview
SFR-Iterative-DPO-LLaMA-3-8B-R is an 8 billion parameter instruct model from Salesforce, built upon the LLaMA-3 architecture. It distinguishes itself through an innovative iterative DPO (Direct Preference Optimization) based online RLHF (Reinforcement Learning from Human Feedback) training approach. This method is designed to be more efficient and simpler than PPO-based alternatives, while effectively mitigating distribution shifts during policy optimization.
Key Capabilities & Performance
This model demonstrates state-of-the-art performance within its class, surpassing other 8B models like LLaMA-3-8B-it, many larger open-source models (e.g., Mixtral-8x7B-it), and even strong proprietary models such as GPT-3.5-turbo-0613 on key instruct benchmarks. Notably, it achieves:
- 37.2 on Alpaca-Eval-V2
- 8.46 on MT-Bench
- 29.1 on Chat-Arena-Hard
These results are achieved using only open-sourced datasets, without reliance on additional human or GPT-4 labeling. While excelling in instruct benchmarks, its academic benchmark scores (e.g., GSM-8K, MMLU, HumanEval) are competitive with other LLaMA-3-8B variants.
When to Use This Model
- Instruction Following: Ideal for applications requiring high-quality responses to user instructions and prompts.
- Conversational AI: Suitable for chatbots and interactive agents where strong performance on chat benchmarks is crucial.
- Resource-Efficient Deployment: Offers competitive performance at an 8B parameter scale, making it a strong candidate for scenarios where larger models might be too resource-intensive.
Limitations
As a research model, SFR-Iterative-DPO-LLaMA-3-8B-R may still generate offensive or unethical content under adversarial conditions, despite integrated safety and ethical considerations in its alignment process. Users are encouraged to use it responsibly.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.