Name: Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-full API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Pradheep1647

Overview

This model, Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-full, is a 0.5 billion parameter instruction-tuned causal language model built upon the Qwen/Qwen2.5-0.5B-Instruct base. It has been fine-tuned using a small-scale Reinforcement Learning (RL) post-training run, specifically employing the DPPO method on a subset of the openai/gsm8k dataset.

Key Characteristics

Base Model: Qwen/Qwen2.5-0.5B-Instruct
Training Method: DPPO (Deep Proximal Policy Optimization) on a GSM8K subset.
Dataset: openai/gsm8k (400 training samples, 100 evaluation samples).
Prompt Format: Expects step-by-step reasoning, with the final answer indicated after ####.
Reward System: Rewards correct numeric answers (+1.0) and parseable final answers (+0.1).

Performance & Purpose

This model is presented as a small, controlled experiment rather than a benchmark, with an evaluation accuracy of 0.11 and an eval reward of 0.199. Its primary focus is on demonstrating the effects of RL post-training for mathematical reasoning on a small scale. It is particularly suited for exploring and experimenting with instruction-following for arithmetic problem-solving within a constrained environment.

Usage

Developers can easily load and use the model with the Hugging Face transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-full"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)

Overview

Overview

Key Characteristics

Performance & Purpose

Usage

Full Model Card (README)