jaygala24/Qwen3-4B-GRPO-math-reasoning
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 6, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

jaygala24/Qwen3-4B-GRPO-math-reasoning is a 4 billion parameter Qwen3-based language model fine-tuned using Group Relative Policy Optimization (GRPO) without a KL penalty. This model is specifically optimized for mathematical reasoning tasks, leveraging datasets like GSM8K and MATH for training. It is designed to provide step-by-step reasoning for mathematical problems, making it suitable for applications requiring robust numerical problem-solving capabilities.

Loading preview...

Model Overview

This model, jaygala24/Qwen3-4B-GRPO-math-reasoning, is a 4 billion parameter variant of the Qwen3 architecture. It has been specifically fine-tuned for enhanced mathematical reasoning capabilities using Group Relative Policy Optimization (GRPO), notably without a KL penalty, a technique implemented via the PipelineRL framework.

Key Capabilities

  • Specialized Mathematical Reasoning: Optimized to solve mathematical problems, providing detailed step-by-step reasoning.
  • GRPO Fine-tuning: Utilizes a unique reinforcement learning approach (GRPO with ppo policy loss and 0.0 KL coefficient) for improved performance in its target domain.
  • Trained on Math Datasets: Leverages gsm8k_train and math_train datasets, ensuring a strong foundation in arithmetic and advanced mathematical concepts.
  • Qwen3 Base: Built upon the robust Qwen3-4B model, inheriting its general language understanding while specializing in mathematics.

Good For

  • Mathematical Problem Solving: Ideal for applications requiring accurate and reasoned solutions to math problems.
  • Educational Tools: Can be integrated into platforms for tutoring or generating explanations for mathematical concepts.
  • Research in RL for Reasoning: Provides a practical example of GRPO application in fine-tuning LLMs for specific cognitive tasks.

This model is particularly suited for scenarios where precise, verifiable mathematical outputs are critical, distinguishing it from general-purpose LLMs by its focused optimization on numerical and logical reasoning.