Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-topk
Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-topk is a 0.5 billion parameter Qwen2.5-Instruct model fine-tuned using DPPO-topk on a subset of the OpenAI GSM8K dataset. This model is an experimental run focused on improving mathematical reasoning for small models, specifically targeting arithmetic problem-solving. It aims to enhance the model's ability to produce correct numeric answers and parseable final answers in a step-by-step reasoning format.
Loading preview...
Model Overview
This model, Pradheep1647/qwen2.5-0.5b-instruct-openai-gsm8k-dppo-topk, is a small-scale experimental fine-tune of the Qwen/Qwen2.5-0.5B-Instruct base model. It utilizes the DPPO-topk method for post-training on a subset of the openai/gsm8k dataset, which is designed for mathematical word problems.
Key Characteristics
- Base Model: Qwen2.5-0.5B-Instruct, a 0.5 billion parameter model.
- Fine-tuning Method: DPPO-topk (Direct Preference Optimization with top-k sampling).
- Training Data: A small subset of 400 samples from the
openai/gsm8kdataset, with 100 samples used for evaluation. - Prompt Format: Emphasizes step-by-step reasoning, with the final answer expected after
####. - Reward System: Rewards +1.0 for a correct final numeric answer and +0.1 for a parseable final answer, indicating a focus on structured output and accuracy in mathematical contexts.
Experimental Focus
This model represents a small, controlled experiment rather than a benchmark-setting release. Its primary purpose is to explore the effectiveness of DPPO-topk on a mathematical reasoning dataset for a compact model. While the reported eval_acc is 0.11, the experiment provides insights into the training process and reward shaping for such tasks. It is not intended for production use but as a demonstration of a specific fine-tuning approach for arithmetic problem-solving.