Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-ppo

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:May 22, 2026Architecture:Transformer Warm

Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-ppo is a 0.5 billion parameter Qwen2.5-Instruct model fine-tuned using Proximal Policy Optimization (PPO). It was specifically trained on a subset of the OpenAI GSM8K dataset, focusing on mathematical reasoning tasks. This model is designed for single-GPU deployment and aims to improve performance on arithmetic and problem-solving challenges, particularly those requiring step-by-step reasoning.

Loading preview...

Model Overview

This model, Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-ppo, is a 0.5 billion parameter language model based on the Qwen/Qwen2.5-0.5B-Instruct architecture. It has been fine-tuned using the Proximal Policy Optimization (PPO) method on a subset of the OpenAI GSM8K dataset, which focuses on grade school mathematical word problems.

Key Characteristics

  • Base Model: Qwen2.5-0.5B-Instruct
  • Fine-tuning Method: PPO
  • Training Data: 400 samples from the openai/gsm8k dataset, with 100 samples for evaluation.
  • Reward System: Rewards were structured to encourage correct final numeric answers (+1.0) and parseable final answers (+0.1).
  • Prompt Format: Utilizes a step-by-step reasoning format, expecting the final answer after ####.

Intended Use

This model is a small, controlled experiment demonstrating the application of PPO for improving performance on specific mathematical reasoning tasks. It is suitable for:

  • Exploring the effects of PPO fine-tuning on small language models.
  • Developing and testing approaches for arithmetic and problem-solving within a constrained environment.
  • Use cases where a compact model with some mathematical reasoning capabilities is required, though its current evaluation accuracy is 0.14.