RLHFlow/LLaMA3-SFT

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:8kPublished:May 17, 2024Architecture:Transformer0.0K Warm

RLHFlow/LLaMA3-SFT is an 8 billion parameter SFT (Supervised Fine-Tuning) checkpoint derived from Meta-Llama-3-8B, developed by a team including Hanze Dong and Wei Xiong. This model is specifically designed as a strong baseline for RLHF research, having been fine-tuned on a diverse mixture of high-quality open-source data. It serves as a foundational model for further reinforcement learning applications, offering solid performance across various benchmarks before any RLHF training.

Loading preview...

RLHFlow/LLaMA3-SFT: A Strong SFT Baseline for RLHF Research

This model is an 8 billion parameter Supervised Fine-Tuning (SFT) checkpoint, originating from meta-llama/Meta-Llama-3-8B. It was developed by a research team including Hanze Dong and Wei Xiong, as part of the RLHFlow/Online-RLHF project, detailed in their TMLR 2024 paper, "RLHF Workflow: From Reward Modeling to Online RLHF".

Key Capabilities & Characteristics

  • Foundation for RLHF: Designed specifically as a robust starting point for Reinforcement Learning from Human Feedback (RLHF) research, without having undergone RLHF training itself.
  • Diverse Data Training: Fine-tuned for one epoch on a mixture of diverse, high-quality open-source datasets, ensuring a broad understanding of various tasks.
  • Solid Baseline Performance: Achieves competitive scores in a zero-shot setting across academic benchmarks, including:
    • GSM-8K: 74.2
    • HumanEval: 64.6
    • TruthfulQA: 63.4
    • ARC: 53.5
    • MBPP: 58.6

Good For

  • RLHF Experimentation: Ideal for researchers and developers looking for a strong, pre-trained SFT model to build upon for their RLHF pipelines and experiments.
  • General Language Understanding: Its training on diverse datasets makes it suitable for a wide range of general language understanding and generation tasks.
  • Benchmarking: Can be used as a reliable baseline to compare the performance improvements gained from subsequent RLHF stages or other fine-tuning methods.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p