Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-full

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:May 22, 2026Architecture:Transformer Warm

Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-full is a 0.5 billion parameter instruction-tuned causal language model based on Qwen/Qwen2.5-0.5B-Instruct. This model has undergone a small single-GPU RL post-training run using a subset of the OpenAI GSM8K dataset, specifically optimized for mathematical reasoning tasks. It is designed to generate step-by-step reasoning with a final answer, making it suitable for arithmetic and problem-solving applications.

Loading preview...

Overview

This model, Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-full, is a 0.5 billion parameter instruction-tuned causal language model built upon the Qwen/Qwen2.5-0.5B-Instruct base. It has been fine-tuned using a small-scale Reinforcement Learning (RL) post-training run, specifically employing the DPPO method on a subset of the openai/gsm8k dataset.

Key Characteristics

  • Base Model: Qwen/Qwen2.5-0.5B-Instruct
  • Training Method: DPPO (Deep Proximal Policy Optimization) on a GSM8K subset.
  • Dataset: openai/gsm8k (400 training samples, 100 evaluation samples).
  • Prompt Format: Expects step-by-step reasoning, with the final answer indicated after ####.
  • Reward System: Rewards correct numeric answers (+1.0) and parseable final answers (+0.1).

Performance & Purpose

This model is presented as a small, controlled experiment rather than a benchmark, with an evaluation accuracy of 0.11 and an eval reward of 0.199. Its primary focus is on demonstrating the effects of RL post-training for mathematical reasoning on a small scale. It is particularly suited for exploring and experimenting with instruction-following for arithmetic problem-solving within a constrained environment.

Usage

Developers can easily load and use the model with the Hugging Face transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-full"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)