RLHFlow/LLaMA3-iterative-DPO-final

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:May 17, 2024License:llama3Architecture:Transformer0.0K Warm

RLHFlow/LLaMA3-iterative-DPO-final is an 8 billion parameter LLaMA3-based instruct model developed by RLHFlow, fine-tuned using an iterative DPO-based online RLHF recipe. This model significantly outperforms other models of similar size, many larger open-sourced models, and strong proprietary models like GPT-3.5-turbo-0613 on chat benchmarks such as Alpaca-Eval-V2, MT-Bench, and Chat-Arena-Hard. It is optimized for instruction following and general conversational AI tasks, demonstrating superior performance without relying on additional human or GPT-4 labeling.

Loading preview...

Overview

RLHFlow/LLaMA3-iterative-DPO-final is an 8 billion parameter instruction-tuned model based on the LLaMA3 architecture. Developed by RLHFlow, this model utilizes a novel iterative DPO-based online RLHF (Reinforcement Learning from Human Feedback) recipe, detailed in their TMLR 2024 paper "RLHF Workflow: From Reward Modeling to Online RLHF". A key differentiator is its training methodology, which employs an online component to mitigate distribution shifts, making it more efficient and simpler to train compared to PPO-based approaches.

Key Capabilities & Performance

This model demonstrates state-of-the-art performance within its class across major instruct model benchmarks:

  • Alpaca-Eval-V2: Achieves 37.2, outperforming LLaMA-3-8B-it (22.9), Mixtral-8x7B-it (23.7), and GPT-3.5-turbo-0613 (22.7).
  • MT-Bench: Scores 8.46, surpassing LLaMA-3-8B-it (8.16) and GPT-3.5-turbo-0613 (8.39).
  • Chat-Arena-Hard: Reaches 29.1, significantly higher than LLaMA-3-8B-it (20.6) and GPT-3.5-turbo-0613 (24.8).

Notably, these results are achieved using only open-sourced datasets, without reliance on additional human or GPT-4 labeling. While excelling in chat benchmarks, its academic benchmark performance (e.g., GSM-8K, MMLU) is competitive with LLaMA-3-8B-it.

When to Use This Model

  • Instruction Following: Ideal for applications requiring high-quality responses to diverse instructions.
  • General Conversational AI: Suitable for chatbots and interactive agents where strong benchmark performance is critical.
  • Research in RLHF: Valuable for researchers interested in DPO-based online RLHF methods and their practical application, with a detailed reproduction recipe available.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p