Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-grpo

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:May 22, 2026Architecture:Transformer Warm

Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-grpo is a 0.5 billion parameter Qwen2.5-Instruct model fine-tuned using the GRPPO method on a subset of the OpenAI GSM8K dataset. This model is specifically optimized for mathematical reasoning tasks, aiming to improve accuracy in numerical problem-solving. It focuses on generating step-by-step reasoning with a final answer, making it suitable for applications requiring structured mathematical output.

Loading preview...

Model Overview

This model, Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-grpo, is a 0.5 billion parameter variant of the Qwen/Qwen2.5-0.5B-Instruct base model. It has undergone a small, single-GPU reinforcement learning post-training run using the GRPO method.

Key Characteristics

  • Base Model: Qwen2.5-0.5B-Instruct.
  • Fine-tuning Method: GRPPO (Generalized Reward Policy Optimization).
  • Training Data: A subset of the openai/gsm8k dataset, specifically configured for mathematical word problems.
  • Optimization Goal: Enhanced performance in mathematical reasoning, with a reward system designed to prioritize correct final numeric answers and parseable outputs.
  • Prompt Format: Expects step-by-step reasoning leading to a final answer, typically marked after ####.

Performance Insights

This model is a controlled experiment rather than a benchmark, with specific metrics recorded:

  • Evaluation Accuracy: 0.12
  • Evaluation Reward: 0.208
  • Training Reward: 0.06875

These metrics reflect its performance on the GSM8K subset it was trained and evaluated on. The model's small size and focused training make it suitable for exploring the effects of GRPPO on mathematical reasoning tasks within a constrained environment.