jaygala24/Qwen3-4B-ReMax-math-reasoning

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 13, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The jaygala24/Qwen3-4B-ReMax-math-reasoning model is a fine-tuned version of the Qwen3-4B architecture, specifically optimized for mathematical reasoning tasks. Developed by jaygala24, this model leverages the ReMax reinforcement learning algorithm without a KL penalty to enhance its problem-solving capabilities. It demonstrates strong performance on benchmarks like GSM8K and MATH-500, making it suitable for applications requiring accurate step-by-step mathematical solutions.

Loading preview...

Overview

This model, jaygala24/Qwen3-4B-ReMax-math-reasoning, is a specialized fine-tune of the Qwen3-4B base model, developed by jaygala24. Its primary focus is on mathematical reasoning, achieved through fine-tuning with the ReMax reinforcement learning algorithm (without KL penalty) using the PipelineRL framework.

Key Capabilities & Performance

The model has been trained on mathematical datasets including gsm8k_train and math_train, and evaluated on gsm8k_test and math_500. It exhibits strong performance in mathematical problem-solving, as evidenced by its pass@k scores:

  • GSM8K (test): 89.23% pass@1, 96.13% pass@32
  • MATH-500: 81.25% pass@1, 96.60% pass@32
  • Overall: 87.04% pass@1, 96.26% pass@32

These results are based on generating 32 samples per problem with a temperature of 1.0.

Training Details

The fine-tuning process utilized a learning rate of 1e-06, a sequence length of 8192, and an effective batch size of 256. The ReMax algorithm employed a greedy-decoded response reward as the advantage baseline and performed 1 deterministic rollout per prompt. Full training logs are available on Weights & Biases.

Good for

  • Applications requiring accurate mathematical reasoning and step-by-step problem-solving.
  • Tasks involving arithmetic, algebra, and other quantitative challenges where a high pass rate on multiple attempts is beneficial.