Name: RLHFlow/Qwen2.5-7B-PPO-Zero API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: RLHFlow

Overview

RLHFlow/Qwen2.5-7B-PPO-Zero is a 7.6 billion parameter model developed by RLHFlow, fine-tuned from the Qwen2.5-MATH-7B-base. This model leverages a rule-based Reinforcement Learning from Human Feedback (RLHF) approach, specifically Proximal Policy Optimization (PPO), to enhance its mathematical reasoning capabilities. It is part of a series of models, including iterative DPO and RAFT variants, all aimed at improving performance on complex mathematical tasks.

Key Capabilities

Enhanced Mathematical Reasoning: Achieves significant improvements over its base model on five widely-adopted mathematical benchmarks: AIME 2024, MATH 500, AMC, Minerva Math, and OlympiadBench.
PPO Fine-tuning: Utilizes Proximal Policy Optimization (PPO) for alignment, building on the success of models like Deepseek-R1-Zero.
High Context Length: Supports a context window of 131,072 tokens, enabling processing of extensive mathematical problems or complex instructions.

Performance Highlights

On average, RLHFlow/Qwen2.5-7B-PPO-Zero achieves an average score of 51.8 across the five mathematical benchmarks, representing a +21.6 improvement over the Qwen2.5-Math-7B-Base. Notably, it scores 43.3 (+26.6) on AIME 2024 and 62.5 (+10.0) on AMC, outperforming several baselines including Qwen-2.5-Math-7B-Instruct and Llama-3.1-70B-Instruct in specific math categories.

Good for

Mathematical Problem Solving: Ideal for applications requiring high accuracy in solving advanced math problems.
Research and Development: Provides a strong baseline for further research into RLHF methods for mathematical reasoning.
Educational Tools: Can be integrated into systems designed to assist with or generate solutions for complex mathematical challenges.

Overview

Overview

Key Capabilities

Performance Highlights

Good for

Full Model Card (README)