p1atdev/qwen2.5-0.5b-grpo-math-01
p1atdev/qwen2.5-0.5b-grpo-math-01 is a 0.5 billion parameter Qwen2.5-based causal language model fine-tuned by p1atdev using GRPO. This model is specifically optimized for solving simple arithmetic problems involving addition and multiplication, demonstrating a structured thought process before providing an answer. It excels at generating step-by-step reasoning for basic math formulas like 'A + B * C' or 'A * B + C'.
Loading preview...
Model Overview
p1atdev/qwen2.5-0.5b-grpo-math-01 is a compact 0.5 billion parameter language model built upon the Qwen2.5 architecture. It has been fine-tuned using the GRPO (Generative Reinforcement Learning with Policy Optimization) method, specifically targeting the domain of simple arithmetic problem-solving. The model's training focused on generating structured responses that include a thinking process (<think></think>) before delivering the final answer (<answer></answer>).
Key Capabilities
- Arithmetic Problem Solving: Specialized in calculating results for basic formulas of the type
A + B * CorA * B + C. - Structured Reasoning: Generates a detailed thought process within
<think>tags, followed by the numerical answer in<answer>tags, adhering to a specific prompt format. - Efficient Training: Trained on a single A100 80G GPU for approximately 1 hour over 140 steps, demonstrating efficient fine-tuning for a niche task.
Training Details
The model was trained using a synthetic dataset of 100,000 simple arithmetic problems. Reward functions were implemented to encourage both correct output formatting and accurate numerical answers. The training utilized trl's GRPOTrainer with bf16 precision.
Use Cases
This model is ideal for applications requiring accurate and explainable solutions to simple arithmetic problems, particularly where a structured reasoning output is beneficial. Its small size makes it suitable for deployment in resource-constrained environments or for integration into larger systems as a specialized math co-processor for basic calculations.