jaygala24/Qwen2.5-3B-GRPO-math-reasoning
The jaygala24/Qwen2.5-3B-GRPO-math-reasoning model is a 3 billion parameter Qwen2.5-based language model fine-tuned by jaygala24. It utilizes Group Relative Policy Optimization (GRPO) without a KL penalty, specifically optimized for mathematical reasoning tasks. This model demonstrates strong performance on benchmarks like GSM8K and MATH-500, making it suitable for applications requiring accurate step-by-step mathematical problem-solving.
Loading preview...
Model Overview
This model, jaygala24/Qwen2.5-3B-GRPO-math-reasoning, is a specialized fine-tune of the Qwen2.5-3B base model. Its primary distinction lies in its training methodology: it leverages Group Relative Policy Optimization (GRPO) without a KL penalty, a reinforcement learning technique designed to enhance mathematical reasoning capabilities.
Key Capabilities & Training
- Mathematical Reasoning: Specifically optimized for solving mathematical problems, as evidenced by its strong performance on relevant benchmarks.
- GRPO Fine-tuning: Utilizes a unique RL algorithm (GRPO with KL coefficient 0.0) for policy optimization, trained with PipelineRL.
- Dataset Focus: Trained on a combination of
gsm8k_trainandmath_traindatasets, ensuring exposure to diverse mathematical problems. - Evaluation: Achieves notable pass@k scores on mathematical reasoning benchmarks:
- GSM8K (test): 84.45% pass@1
- MATH-500: 64.48% pass@1
- Overall: 78.96% pass@1 across 1819 problems.
When to Use This Model
This model is particularly well-suited for applications requiring:
- Accurate Mathematical Problem Solving: Ideal for tasks that demand step-by-step reasoning to arrive at a numerical or logical mathematical answer.
- Educational Tools: Can be integrated into systems for generating solutions or explanations for math problems.
- Research in RL for Reasoning: Provides a strong baseline for exploring the impact of GRPO on complex reasoning tasks.
Its specialized training makes it a robust choice for focused mathematical applications, offering competitive performance within its parameter class.