Name: Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-grpo API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Pradheep1647

Model Overview

This model, Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-grpo, is a 0.5 billion parameter variant of the Qwen/Qwen2.5-0.5B-Instruct base model. It has undergone a small, single-GPU reinforcement learning post-training run using the GRPO method.

Key Characteristics

Base Model: Qwen2.5-0.5B-Instruct.
Fine-tuning Method: GRPPO (Generalized Reward Policy Optimization).
Training Data: A subset of the openai/gsm8k dataset, specifically configured for mathematical word problems.
Optimization Goal: Enhanced performance in mathematical reasoning, with a reward system designed to prioritize correct final numeric answers and parseable outputs.
Prompt Format: Expects step-by-step reasoning leading to a final answer, typically marked after ####.

Performance Insights

This model is a controlled experiment rather than a benchmark, with specific metrics recorded:

Evaluation Accuracy: 0.12
Evaluation Reward: 0.208
Training Reward: 0.06875

These metrics reflect its performance on the GSM8K subset it was trained and evaluated on. The model's small size and focused training make it suitable for exploring the effects of GRPPO on mathematical reasoning tasks within a constrained environment.

Overview

Model Overview

Key Characteristics

Performance Insights

Full Model Card (README)