aryan-kolapkar/MathReasoner-Mini-1.5b
MathReasoner-Mini-1.5b by aryan-kolapkar is a 1.5 billion parameter reasoning model, built upon Qwen2.5-Math-1.5B-base, specifically fine-tuned for mathematical reasoning tasks. It excels at solving high school level math problems, achieving approximately 83.7% accuracy on the GSM8K benchmark. The model is optimized for structured outputs, with 99% accuracy in generating reasoning within tags and answers within tags, and has a context length of 32768 tokens.
Loading preview...
MathReasoner-Mini-1.5b: Specialized Mathematical Reasoning Model
MathReasoner-Mini-1.5b is a 1.5 billion parameter language model developed by aryan-kolapkar, specifically engineered for mathematical reasoning. Built on the Qwen2.5-Math-1.5B-base architecture, this model has undergone a rigorous three-stage training process (SFT, DPO, and GRPO) to enhance its ability to solve school-level math problems, particularly those found in the GSM8K dataset.
Key Capabilities & Performance
- High Mathematical Accuracy: Achieves approximately 83.7% Pass@1 zero-shot accuracy on the GSM8K benchmark, a significant improvement over the base Qwen2.5-Math-1.5B's 54%.
- Structured Output: Demonstrates 99% accuracy in generating structured outputs, enclosing reasoning within
<think>tags and numerical answers within<answer>tags, crucial for automated evaluation and clarity. - Reinforcement Learning Enhanced: Utilizes GRPO (Generative Reinforcement Learning with Policy Optimization) with a custom reward function focusing on format strictness and correctness, further refining its reasoning capabilities.
- Context Length: Supports a substantial context length of 32768 tokens.
Training Methodology
The model's performance is a result of a multi-stage training approach:
- Supervised Fine-Tuning (SFT): Initial training on a curated GSM8K subset with self-verified generations.
- Direct Preference Optimization (DPO): Fine-tuning with ~1,000 preference pairs, emphasizing correct vs. incorrect reasoning and shorter, correct CoT (Chain-of-Thought) samples.
- GRPO Reinforcement Learning: Further optimization using GRPO on the GSM8K train split, incorporating a custom reward for format and correctness.
Recommended Use Cases
- High School Level Math Problems: Ideal for tasks requiring step-by-step mathematical reasoning at a high school curriculum level.
- Structured Reasoning Output: Particularly effective when applications require clearly delineated reasoning processes and final answers.
Note: The model is primarily designed for mathematical tasks and performs best when questions are posed in English. It is not advised for other general-purpose tasks.