Name: RLHFlow/LLaMA3-iterative-DPO-final API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: RLHFlow

Overview

RLHFlow/LLaMA3-iterative-DPO-final is an 8 billion parameter instruction-tuned model based on the LLaMA3 architecture. Developed by RLHFlow, this model utilizes a novel iterative DPO-based online RLHF (Reinforcement Learning from Human Feedback) recipe, detailed in their TMLR 2024 paper "RLHF Workflow: From Reward Modeling to Online RLHF". A key differentiator is its training methodology, which employs an online component to mitigate distribution shifts, making it more efficient and simpler to train compared to PPO-based approaches.

Key Capabilities & Performance

This model demonstrates state-of-the-art performance within its class across major instruct model benchmarks:

Alpaca-Eval-V2: Achieves 37.2, outperforming LLaMA-3-8B-it (22.9), Mixtral-8x7B-it (23.7), and GPT-3.5-turbo-0613 (22.7).
MT-Bench: Scores 8.46, surpassing LLaMA-3-8B-it (8.16) and GPT-3.5-turbo-0613 (8.39).
Chat-Arena-Hard: Reaches 29.1, significantly higher than LLaMA-3-8B-it (20.6) and GPT-3.5-turbo-0613 (24.8).

Notably, these results are achieved using only open-sourced datasets, without reliance on additional human or GPT-4 labeling. While excelling in chat benchmarks, its academic benchmark performance (e.g., GSM-8K, MMLU) is competitive with LLaMA-3-8B-it.

When to Use This Model

Instruction Following: Ideal for applications requiring high-quality responses to diverse instructions.
General Conversational AI: Suitable for chatbots and interactive agents where strong benchmark performance is critical.
Research in RLHF: Valuable for researchers interested in DPO-based online RLHF methods and their practical application, with a detailed reproduction recipe available.

Overview

Overview

Key Capabilities & Performance

When to Use This Model

Full Model Card (README)