Name: rzzhan/ExGRPO-Qwen2.5-Math-7B-Zero API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: rzzhan

ExGRPO-Qwen2.5-Math-7B-Zero: Experience-Driven Mathematical Reasoning

The ExGRPO-Qwen2.5-Math-7B-Zero model, developed by rzzhan, is a 7.6 billion parameter language model built upon the Qwen2.5-Math-7B architecture. It implements the novel ExGRPO (Experience-Guided Reinforcement Learning with Prioritized Optimization) framework, which addresses the inefficiencies of traditional on-policy optimization in Reinforcement Learning from Human Feedback (RLHF) for reasoning tasks.

Key Capabilities & Innovations

Strategic Experience Management: ExGRPO introduces a system to identify, manage, and replay "high-value" experiences during RLHF training, using online proxy metrics like rollout correctness and trajectory entropy to quantify experience quality.
Enhanced Training Efficiency: By prioritizing and replaying valuable past explorations, the framework enables more efficient and stable RLHF training, mitigating issues like training collapse in challenging scenarios.
Broad Applicability: The ExGRPO framework demonstrates generalization across various backbone models, including Llama3.1 and other Qwen2.5 variants, for mathematical reasoning.
Mathematical Reasoning Focus: This specific model is fine-tuned for mathematical problem-solving, building on the Qwen2.5-Math-7B base.

Good For

Mathematical Reasoning Tasks: Excels in solving complex math problems across benchmarks like AIME, AMC, MATH-500, Minerva, and Olympiad.
Research in RLHF Optimization: Provides a robust framework for exploring advanced experience replay and management techniques in reinforcement learning for language models.
Developing Stable Reasoning Agents: Offers a solution for improving the stability and efficiency of training agents for complex, multi-step reasoning processes.

Overview

ExGRPO-Qwen2.5-Math-7B-Zero: Experience-Driven Mathematical Reasoning

Key Capabilities & Innovations

Good For

Full Model Card (README)