jaygala24/Qwen3-1.7B-GRPO-math-reasoning

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 6, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The jaygala24/Qwen3-1.7B-GRPO-math-reasoning model is a fine-tuned version of Qwen3-1.7B, specifically optimized for mathematical reasoning tasks. It was trained using GRPO (Group Relative Policy Optimization) without a KL penalty on GSM8K and MATH datasets. This model demonstrates strong performance on math reasoning benchmarks, achieving an overall pass@32 of 95.05% across GSM8K and MATH-500 datasets. It is designed for applications requiring accurate step-by-step mathematical problem-solving.

Loading preview...

Overview

This model, jaygala24/Qwen3-1.7B-GRPO-math-reasoning, is a specialized fine-tuned variant of the Qwen3-1.7B base model. Its primary focus is on enhancing mathematical reasoning capabilities through a unique training methodology.

Key Capabilities

  • Mathematical Reasoning: Specifically optimized for solving mathematical problems, as evidenced by its training on gsm8k_train and math_train datasets.
  • GRPO Training: Utilizes Group Relative Policy Optimization (GRPO) without a KL penalty, a reinforcement learning technique, to refine its reasoning process.
  • Strong Benchmark Performance: Achieves notable results on math reasoning benchmarks:
    • GSM8K (test): 79.73% pass@1, 95.38% pass@32
    • MATH-500: 69.84% pass@1, 94.20% pass@32
    • Overall: 77.01% pass@1, 95.05% pass@32 across 1819 problems.

Training Details

The model was trained using PipelineRL with specific hyperparameters including a learning rate of 1e-06, a sequence length of 8192, and bf16 precision, leveraging DeepSpeed ZeRO Stage 3.

When to Use This Model

This model is ideal for use cases requiring robust and accurate mathematical problem-solving, particularly those involving step-by-step reasoning. Its fine-tuning on dedicated math datasets makes it a strong candidate for educational tools, automated problem solvers, or any application where precise numerical and logical deduction is critical.