aryan-kolapkar/MathReasoner-Mini-1.5b

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Nov 20, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

MathReasoner-Mini-1.5b by aryan-kolapkar is a 1.5 billion parameter reasoning model, built upon Qwen2.5-Math-1.5B-base, specifically fine-tuned for mathematical reasoning tasks. It excels at solving high school level math problems, achieving approximately 83.7% accuracy on the GSM8K benchmark. The model is optimized for structured outputs, with 99% accuracy in generating reasoning within tags and answers within tags, and has a context length of 32768 tokens.

Loading preview...

MathReasoner-Mini-1.5b: Specialized Mathematical Reasoning Model

MathReasoner-Mini-1.5b is a 1.5 billion parameter language model developed by aryan-kolapkar, specifically engineered for mathematical reasoning. Built on the Qwen2.5-Math-1.5B-base architecture, this model has undergone a rigorous three-stage training process (SFT, DPO, and GRPO) to enhance its ability to solve school-level math problems, particularly those found in the GSM8K dataset.

Key Capabilities & Performance

  • High Mathematical Accuracy: Achieves approximately 83.7% Pass@1 zero-shot accuracy on the GSM8K benchmark, a significant improvement over the base Qwen2.5-Math-1.5B's 54%.
  • Structured Output: Demonstrates 99% accuracy in generating structured outputs, enclosing reasoning within <think> tags and numerical answers within <answer> tags, crucial for automated evaluation and clarity.
  • Reinforcement Learning Enhanced: Utilizes GRPO (Generative Reinforcement Learning with Policy Optimization) with a custom reward function focusing on format strictness and correctness, further refining its reasoning capabilities.
  • Context Length: Supports a substantial context length of 32768 tokens.

Training Methodology

The model's performance is a result of a multi-stage training approach:

  1. Supervised Fine-Tuning (SFT): Initial training on a curated GSM8K subset with self-verified generations.
  2. Direct Preference Optimization (DPO): Fine-tuning with ~1,000 preference pairs, emphasizing correct vs. incorrect reasoning and shorter, correct CoT (Chain-of-Thought) samples.
  3. GRPO Reinforcement Learning: Further optimization using GRPO on the GSM8K train split, incorporating a custom reward for format and correctness.

Recommended Use Cases

  • High School Level Math Problems: Ideal for tasks requiring step-by-step mathematical reasoning at a high school curriculum level.
  • Structured Reasoning Output: Particularly effective when applications require clearly delineated reasoning processes and final answers.

Note: The model is primarily designed for mathematical tasks and performs best when questions are posed in English. It is not advised for other general-purpose tasks.