Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-ppo
Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-ppo is a 0.5 billion parameter Qwen2.5-Instruct model fine-tuned using Proximal Policy Optimization (PPO). It was specifically trained on a subset of the OpenAI GSM8K dataset, focusing on mathematical reasoning tasks. This model is designed for single-GPU deployment and aims to improve performance on arithmetic and problem-solving challenges, particularly those requiring step-by-step reasoning.
Loading preview...
Model Overview
This model, Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-ppo, is a 0.5 billion parameter language model based on the Qwen/Qwen2.5-0.5B-Instruct architecture. It has been fine-tuned using the Proximal Policy Optimization (PPO) method on a subset of the OpenAI GSM8K dataset, which focuses on grade school mathematical word problems.
Key Characteristics
- Base Model: Qwen2.5-0.5B-Instruct
- Fine-tuning Method: PPO
- Training Data: 400 samples from the
openai/gsm8kdataset, with 100 samples for evaluation. - Reward System: Rewards were structured to encourage correct final numeric answers (+1.0) and parseable final answers (+0.1).
- Prompt Format: Utilizes a step-by-step reasoning format, expecting the final answer after
####.
Intended Use
This model is a small, controlled experiment demonstrating the application of PPO for improving performance on specific mathematical reasoning tasks. It is suitable for:
- Exploring the effects of PPO fine-tuning on small language models.
- Developing and testing approaches for arithmetic and problem-solving within a constrained environment.
- Use cases where a compact model with some mathematical reasoning capabilities is required, though its current evaluation accuracy is 0.14.