RLHFlow/Qwen2.5-7B-PPO-Zero
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Feb 13, 2025Architecture:Transformer0.0K Cold

RLHFlow/Qwen2.5-7B-PPO-Zero is a 7.6 billion parameter language model developed by RLHFlow, fine-tuned from Qwen2.5-MATH-7B-base using a rule-based Reinforcement Learning from Human Feedback (RLHF) approach, specifically Proximal Policy Optimization (PPO). This model is optimized for mathematical reasoning and problem-solving, demonstrating significant performance enhancements on benchmarks like AIME 2024, MATH 500, and OlympiadBench. It is designed for applications requiring strong mathematical capabilities and robust reasoning over a 131,072 token context length.

Loading preview...