Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-full
Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-full is a 0.5 billion parameter instruction-tuned causal language model based on Qwen/Qwen2.5-0.5B-Instruct. This model has undergone a small single-GPU RL post-training run using a subset of the OpenAI GSM8K dataset, specifically optimized for mathematical reasoning tasks. It is designed to generate step-by-step reasoning with a final answer, making it suitable for arithmetic and problem-solving applications.
Loading preview...
Overview
This model, Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-full, is a 0.5 billion parameter instruction-tuned causal language model built upon the Qwen/Qwen2.5-0.5B-Instruct base. It has been fine-tuned using a small-scale Reinforcement Learning (RL) post-training run, specifically employing the DPPO method on a subset of the openai/gsm8k dataset.
Key Characteristics
- Base Model: Qwen/Qwen2.5-0.5B-Instruct
- Training Method: DPPO (Deep Proximal Policy Optimization) on a GSM8K subset.
- Dataset:
openai/gsm8k(400 training samples, 100 evaluation samples). - Prompt Format: Expects step-by-step reasoning, with the final answer indicated after
####. - Reward System: Rewards correct numeric answers (+1.0) and parseable final answers (+0.1).
Performance & Purpose
This model is presented as a small, controlled experiment rather than a benchmark, with an evaluation accuracy of 0.11 and an eval reward of 0.199. Its primary focus is on demonstrating the effects of RL post-training for mathematical reasoning on a small scale. It is particularly suited for exploring and experimenting with instruction-following for arithmetic problem-solving within a constrained environment.
Usage
Developers can easily load and use the model with the Hugging Face transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-full"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)