RLHFlow/LLaMA3-iterative-DPO-final
RLHFlow/LLaMA3-iterative-DPO-final is an 8 billion parameter LLaMA3-based instruct model developed by RLHFlow, fine-tuned using an iterative DPO-based online RLHF recipe. This model significantly outperforms other models of similar size, many larger open-sourced models, and strong proprietary models like GPT-3.5-turbo-0613 on chat benchmarks such as Alpaca-Eval-V2, MT-Bench, and Chat-Arena-Hard. It is optimized for instruction following and general conversational AI tasks, demonstrating superior performance without relying on additional human or GPT-4 labeling.
Loading preview...
Overview
RLHFlow/LLaMA3-iterative-DPO-final is an 8 billion parameter instruction-tuned model based on the LLaMA3 architecture. Developed by RLHFlow, this model utilizes a novel iterative DPO-based online RLHF (Reinforcement Learning from Human Feedback) recipe, detailed in their TMLR 2024 paper "RLHF Workflow: From Reward Modeling to Online RLHF". A key differentiator is its training methodology, which employs an online component to mitigate distribution shifts, making it more efficient and simpler to train compared to PPO-based approaches.
Key Capabilities & Performance
This model demonstrates state-of-the-art performance within its class across major instruct model benchmarks:
- Alpaca-Eval-V2: Achieves 37.2, outperforming LLaMA-3-8B-it (22.9), Mixtral-8x7B-it (23.7), and GPT-3.5-turbo-0613 (22.7).
- MT-Bench: Scores 8.46, surpassing LLaMA-3-8B-it (8.16) and GPT-3.5-turbo-0613 (8.39).
- Chat-Arena-Hard: Reaches 29.1, significantly higher than LLaMA-3-8B-it (20.6) and GPT-3.5-turbo-0613 (24.8).
Notably, these results are achieved using only open-sourced datasets, without reliance on additional human or GPT-4 labeling. While excelling in chat benchmarks, its academic benchmark performance (e.g., GSM-8K, MMLU) is competitive with LLaMA-3-8B-it.
When to Use This Model
- Instruction Following: Ideal for applications requiring high-quality responses to diverse instructions.
- General Conversational AI: Suitable for chatbots and interactive agents where strong benchmark performance is critical.
- Research in RLHF: Valuable for researchers interested in DPO-based online RLHF methods and their practical application, with a detailed reproduction recipe available.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.