Name: Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-ppo API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Pradheep1647

Model Overview

This model, Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-ppo, is a 0.5 billion parameter language model based on the Qwen/Qwen2.5-0.5B-Instruct architecture. It has been fine-tuned using the Proximal Policy Optimization (PPO) method on a subset of the OpenAI GSM8K dataset, which focuses on grade school mathematical word problems.

Key Characteristics

Base Model: Qwen2.5-0.5B-Instruct
Fine-tuning Method: PPO
Training Data: 400 samples from the openai/gsm8k dataset, with 100 samples for evaluation.
Reward System: Rewards were structured to encourage correct final numeric answers (+1.0) and parseable final answers (+0.1).
Prompt Format: Utilizes a step-by-step reasoning format, expecting the final answer after ####.

Intended Use

This model is a small, controlled experiment demonstrating the application of PPO for improving performance on specific mathematical reasoning tasks. It is suitable for:

Exploring the effects of PPO fine-tuning on small language models.
Developing and testing approaches for arithmetic and problem-solving within a constrained environment.
Use cases where a compact model with some mathematical reasoning capabilities is required, though its current evaluation accuracy is 0.14.

Overview

Model Overview

Key Characteristics

Intended Use

Full Model Card (README)