Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-grpo
Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-grpo is a 0.5 billion parameter Qwen2.5-Instruct model fine-tuned using the GRPPO method on a subset of the OpenAI GSM8K dataset. This model is specifically optimized for mathematical reasoning tasks, aiming to improve accuracy in numerical problem-solving. It focuses on generating step-by-step reasoning with a final answer, making it suitable for applications requiring structured mathematical output.
Loading preview...
Model Overview
This model, Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-grpo, is a 0.5 billion parameter variant of the Qwen/Qwen2.5-0.5B-Instruct base model. It has undergone a small, single-GPU reinforcement learning post-training run using the GRPO method.
Key Characteristics
- Base Model: Qwen2.5-0.5B-Instruct.
- Fine-tuning Method: GRPPO (Generalized Reward Policy Optimization).
- Training Data: A subset of the
openai/gsm8kdataset, specifically configured for mathematical word problems. - Optimization Goal: Enhanced performance in mathematical reasoning, with a reward system designed to prioritize correct final numeric answers and parseable outputs.
- Prompt Format: Expects step-by-step reasoning leading to a final answer, typically marked after
####.
Performance Insights
This model is a controlled experiment rather than a benchmark, with specific metrics recorded:
- Evaluation Accuracy:
0.12 - Evaluation Reward:
0.208 - Training Reward:
0.06875
These metrics reflect its performance on the GSM8K subset it was trained and evaluated on. The model's small size and focused training make it suitable for exploring the effects of GRPPO on mathematical reasoning tasks within a constrained environment.