RLHFlow/Llama3.1-8B-PRM-Deepseek-Data

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Nov 8, 2024Architecture:Transformer0.0K Warm

RLHFlow/Llama3.1-8B-PRM-Deepseek-Data is an 8 billion parameter process-supervised reward model, fine-tuned from Meta's Llama-3.1-8B-Instruct. Developed by RLHFlow, this model is specifically trained on the Deepseek-PRM-Data dataset with a 32768 token context length to excel at evaluating and improving mathematical reasoning. It demonstrates strong performance in mathematical problem-solving benchmarks like GSM8K and MATH, particularly when used for process-supervised reward modeling.

Loading preview...

Overview

RLHFlow/Llama3.1-8B-PRM-Deepseek-Data is an 8 billion parameter process-supervised reward model (PRM) developed by RLHFlow. It is fine-tuned from the meta-llama/Llama-3.1-8B-Instruct base model, leveraging the RLHFlow/Deepseek-PRM-Data dataset for one epoch. The training utilized a global batch size of 32 and a learning rate of 2e-6, with samples packed and chunked into 8192 tokens.

Key Capabilities

This model is designed to provide process-supervised rewards, primarily for mathematical reasoning tasks. Its core strength lies in evaluating the step-by-step reasoning process of other language models, rather than just the final answer. This capability is crucial for improving the performance of generative models on complex problems.

Performance Highlights

The model demonstrates significant improvements in mathematical benchmarks when applied as a PRM. For instance, when used with a Mistral-7B generator, it boosts GSM8K performance to 92.4% and MATH to 46.3% (Mistral-PRM@1024). With a Deepseek-7B generator, it achieves 93.0% on GSM8K and 58.1% on MATH (Deepseek-PRM@1024), outperforming simple Pass@1 or Majority Voting methods. These results highlight its effectiveness in guiding and refining mathematical problem-solving.

Good For

  • Evaluating mathematical reasoning: Ideal for assessing the correctness and quality of intermediate steps in mathematical solutions.
  • Improving LLM performance on math tasks: Can be integrated into reinforcement learning from human feedback (RLHF) pipelines to enhance the mathematical capabilities of generative models.
  • Research in process-supervised reward modeling: Provides a strong baseline for further research into training and applying PRMs for complex reasoning.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p