khazarai/Math-RL

TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Mar 24, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

khazarai/Math-RL is a 0.5 billion parameter language model, fine-tuned from Qwen2.5-0.5B-Instruct using Group Relative Policy Optimization (GRPO) on a curated dataset of 700 math problems. This model is specifically optimized to enhance step-by-step reasoning for mathematical problem-solving. It is designed for educational assistance, research into small-scale RLHF-style fine-tuning, and as a lightweight math reasoning assistant in constrained environments.

Loading preview...

Model Overview

khazarai/Math-RL is a 0.5 billion parameter model, fine-tuned from Qwen2.5-0.5B-Instruct. Its primary objective is to improve mathematical problem-solving capabilities through enhanced step-by-step reasoning. The model was optimized using Group Relative Policy Optimization (GRPO) with LoRa on a dataset of approximately 700 math problems.

Key Capabilities

  • Mathematical Reasoning: Specialized in generating step-by-step reasoning for math problems.
  • Small-Scale RLHF Research: Suitable for experiments with GRPO, a form of RLHF-style fine-tuning, on smaller instruction-tuned models.
  • Lightweight Deployment: Designed to function as a math reasoning assistant in environments with limited computational resources.
  • Educational Support: Can assist students with understanding and solving mathematical problems.

Intended Use Cases

  • Educational Tools: Integrating into platforms for math homework help or tutoring.
  • Research & Development: Exploring the effectiveness of GRPO and similar fine-tuning methods on reasoning tasks.
  • Resource-Constrained Applications: Deploying math assistance where larger models are impractical.

Limitations

Due to fine-tuning on a relatively small dataset (700 problems), the model's generalization to diverse math problems is limited. It may produce incorrect or hallucinated answers and should not be relied upon for high-stakes calculations or critical applications. Its strongest performance is on problems similar to its training data, and while the base model is multilingual, the math-specific fine-tuning was primarily English-based.