jaygala24/Qwen2.5-0.5B-GRPO-math-reasoning
jaygala24/Qwen2.5-0.5B-GRPO-math-reasoning is a 0.5 billion parameter Qwen2.5 model fine-tuned by jaygala24 using Group Relative Policy Optimization (GRPO) without a KL penalty. This model is specifically optimized for mathematical reasoning tasks, leveraging datasets like GSM8K and MATH-500. It demonstrates strong performance on math reasoning benchmarks, making it suitable for applications requiring numerical problem-solving capabilities.
Loading preview...
Overview
This model, jaygala24/Qwen2.5-0.5B-GRPO-math-reasoning, is a specialized fine-tune of the Qwen2.5-0.5B base model. Developed by jaygala24, its core differentiator is the application of Group Relative Policy Optimization (GRPO) without a KL penalty for enhanced mathematical reasoning. The training utilized the PipelineRL framework and focused on datasets such as gsm8k_train and math_train.
Key Capabilities
- Mathematical Reasoning: Specifically fine-tuned to excel in solving mathematical problems, as evidenced by its training on GSM8K and MATH-500 datasets.
- GRPO Optimization: Leverages a unique reinforcement learning approach (GRPO with group mean reward as baseline) to improve policy performance in reasoning tasks.
- Compact Size: At 0.5 billion parameters, it offers a relatively small footprint while delivering competitive performance in its specialized domain.
Evaluation Highlights
The model achieved notable pass@k scores on mathematical benchmarks:
- GSM8K (test): 51.77% pass@1, 89.76% pass@32
- MATH-500: 31.18% pass@1, 73.00% pass@32
- Overall: 46.11% pass@1, 85.16% pass@32
Good for
- Mathematical Problem Solving: Ideal for applications requiring accurate step-by-step mathematical reasoning.
- Educational Tools: Can be integrated into systems designed to assist with or evaluate math homework and exercises.
- Research in RL for Reasoning: Provides a practical example of GRPO's application in fine-tuning LLMs for specific cognitive tasks.