RLHFlow/LLaMA3-SFT-v2

Warm
Public
8B
FP8
8192
Hugging Face
Overview

RLHFlow/LLaMA3-SFT-v2 Overview

RLHFlow/LLaMA3-SFT-v2 is an 8 billion parameter supervised fine-tuned (SFT) model derived from meta-llama/Meta-Llama-3-8B. It serves as a key component in the RLHFlow/Online-RLHF project, which focuses on advanced RLHF workflows. The model was trained for 2 epochs on the RLHFlow/RLHFlow-SFT-Dataset-ver2 with a global batch size of 128 and a learning rate of 2e-5, processing samples in 8192-token chunks.

Key Capabilities & Performance

This model shows notable improvements over its base LLaMA-3-8B-it and the previous RLHFlow/LLaMA3-SFT version, particularly in:

  • Mathematical Reasoning: Achieves 83.4 on GSM-8K and 41.1 on MATH, significantly outperforming LLaMA-3-8B-it (79.6 GSM-8K, 26.3 MATH) and RLHFlow/LLaMA3-SFT (74.2 GSM-8K, 30.0 MATH).
  • Code Generation: Scores 66.5 on HumanEval, surpassing LLaMA-3-8B-it (61.6) and RLHFlow/LLaMA3-SFT (63.4).
  • General Knowledge: Maintains strong performance on MMLU (64.8) and ARC (60.0).

Intended Use Cases

RLHFlow/LLaMA3-SFT-v2 is particularly well-suited for applications requiring:

  • Mathematical problem-solving and quantitative reasoning.
  • Code generation and understanding tasks.
  • As a foundational SFT checkpoint for further Reinforcement Learning from Human Feedback (RLHF) research and development, especially within the context of the RLHF Workflow paper.