RLHFlow/Llama3.1-8B-PRM-Deepseek-Data

Warm
Public
8B
FP8
32768
Nov 8, 2024
Hugging Face
Overview

Overview

RLHFlow/Llama3.1-8B-PRM-Deepseek-Data is an 8 billion parameter process-supervised reward model (PRM) developed by RLHFlow. It is fine-tuned from the meta-llama/Llama-3.1-8B-Instruct base model, leveraging the RLHFlow/Deepseek-PRM-Data dataset for one epoch. The training utilized a global batch size of 32 and a learning rate of 2e-6, with samples packed and chunked into 8192 tokens.

Key Capabilities

This model is designed to provide process-supervised rewards, primarily for mathematical reasoning tasks. Its core strength lies in evaluating the step-by-step reasoning process of other language models, rather than just the final answer. This capability is crucial for improving the performance of generative models on complex problems.

Performance Highlights

The model demonstrates significant improvements in mathematical benchmarks when applied as a PRM. For instance, when used with a Mistral-7B generator, it boosts GSM8K performance to 92.4% and MATH to 46.3% (Mistral-PRM@1024). With a Deepseek-7B generator, it achieves 93.0% on GSM8K and 58.1% on MATH (Deepseek-PRM@1024), outperforming simple Pass@1 or Majority Voting methods. These results highlight its effectiveness in guiding and refining mathematical problem-solving.

Good For

  • Evaluating mathematical reasoning: Ideal for assessing the correctness and quality of intermediate steps in mathematical solutions.
  • Improving LLM performance on math tasks: Can be integrated into reinforcement learning from human feedback (RLHF) pipelines to enhance the mathematical capabilities of generative models.
  • Research in process-supervised reward modeling: Provides a strong baseline for further research into training and applying PRMs for complex reasoning.