Overview
RLHFlow/LLaMA3-SFT-v2 Overview
RLHFlow/LLaMA3-SFT-v2 is an 8 billion parameter supervised fine-tuned (SFT) model derived from meta-llama/Meta-Llama-3-8B. It serves as a key component in the RLHFlow/Online-RLHF project, which focuses on advanced RLHF workflows. The model was trained for 2 epochs on the RLHFlow/RLHFlow-SFT-Dataset-ver2 with a global batch size of 128 and a learning rate of 2e-5, processing samples in 8192-token chunks.
Key Capabilities & Performance
This model shows notable improvements over its base LLaMA-3-8B-it and the previous RLHFlow/LLaMA3-SFT version, particularly in:
- Mathematical Reasoning: Achieves 83.4 on GSM-8K and 41.1 on MATH, significantly outperforming LLaMA-3-8B-it (79.6 GSM-8K, 26.3 MATH) and RLHFlow/LLaMA3-SFT (74.2 GSM-8K, 30.0 MATH).
- Code Generation: Scores 66.5 on HumanEval, surpassing LLaMA-3-8B-it (61.6) and RLHFlow/LLaMA3-SFT (63.4).
- General Knowledge: Maintains strong performance on MMLU (64.8) and ARC (60.0).
Intended Use Cases
RLHFlow/LLaMA3-SFT-v2 is particularly well-suited for applications requiring:
- Mathematical problem-solving and quantitative reasoning.
- Code generation and understanding tasks.
- As a foundational SFT checkpoint for further Reinforcement Learning from Human Feedback (RLHF) research and development, especially within the context of the RLHF Workflow paper.