RLHFlow/Llama3.1-8B-ORM-Mistral-Data
RLHFlow/Llama3.1-8B-ORM-Mistral-Data is an 8 billion parameter outcome-supervised reward model (ORM) derived from Meta's Llama-3.1-8B-Instruct, with a 32768 token context length. It is specifically trained on Mistral-generated data from the RLHFlow project to evaluate and improve the performance of language models on mathematical reasoning tasks. This model excels at scoring mathematical problem-solving outputs, significantly boosting accuracy on benchmarks like GSM8K and MATH when used for re-ranking or selection.
Loading preview...
Overview
RLHFlow/Llama3.1-8B-ORM-Mistral-Data is an 8 billion parameter Outcome-supervised Reward Model (ORM) built upon Meta's Llama-3.1-8B-Instruct. It is fine-tuned for one epoch on the RLHFlow/Mistral-ORM-Data dataset, which consists of Mistral-generated mathematical problem-solving data. The model's primary function is to act as a robust evaluator for mathematical reasoning, capable of discerning correct solutions and improving the overall performance of generative models through re-ranking or selection.
Key Capabilities
- Mathematical Reasoning Evaluation: Specialized in assessing the correctness and quality of solutions to mathematical problems.
- Performance Enhancement: Demonstrates significant improvements in benchmarks like GSM8K and MATH when applied to re-rank or select outputs from generative models.
- Out-of-Distribution Robustness: Shows strong performance even when evaluating outputs from models (e.g., Deepseek-7B) that were not part of its direct training data (OOD evaluation).
Performance Highlights
The model significantly boosts the performance of generative models on mathematical tasks. For a Mistral-7B generator, using Mistral-ORM@1024 improved GSM8K Pass@1 from 77.9 to 90.1 and MATH from 28.4 to 43.6. Similar gains were observed with a Deepseek-7B generator, where Mistral-ORM@1024 (OOD) improved GSM8K Pass@1 from 83.9 to 90.3 and MATH from 38.4 to 54.9.
Good For
- Automated Evaluation: Scoring and ranking generated solutions for complex mathematical problems.
- Improving LLM Math Performance: Integrating into pipelines to select higher-quality mathematical outputs from other LLMs.
- Research in Reward Modeling: Exploring outcome-supervised reward modeling techniques for specialized tasks.