jaygala24/Qwen3-4B-GRPO-KL-math-reasoning
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 6, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

jaygala24/Qwen3-4B-GRPO-KL-math-reasoning is a 4 billion parameter language model, fine-tuned from Qwen3-4B, specifically optimized for mathematical reasoning tasks. It utilizes Group Relative Policy Optimization (GRPO) with a KL penalty, trained on `gsm8k` and `math` datasets, to enhance its step-by-step problem-solving capabilities. This model is designed to provide accurate and reasoned answers to mathematical queries, leveraging its 32768 token context length.

Loading preview...

Qwen3-4B-GRPO-KL-math-reasoning Overview

This model is a specialized 4 billion parameter language model, derived from the Qwen3-4B architecture, and fine-tuned by jaygala24. Its core distinction lies in its optimization for mathematical reasoning through the application of Group Relative Policy Optimization (GRPO) with a KL penalty.

Key Capabilities

  • Enhanced Mathematical Reasoning: Specifically trained to process and solve mathematical problems step-by-step.
  • GRPO Fine-tuning: Leverages the GRPO algorithm with a KL penalty for improved policy optimization during training.
  • Dataset Focus: Trained on gsm8k_train and math_train datasets, targeting common mathematical problem types.
  • High Context Length: Supports a sequence length of 8192 tokens, allowing for complex problem descriptions and reasoning chains.

When to Use This Model

  • Mathematical Problem Solving: Ideal for applications requiring accurate and reasoned solutions to arithmetic and mathematical challenges.
  • Educational Tools: Can be integrated into systems for generating explanations or verifying steps in math problems.
  • Research in RL for LLMs: Provides an example of GRPO application in fine-tuning for specific reasoning tasks.

This model is a strong candidate for use cases where robust and verifiable mathematical reasoning is a primary requirement, offering a focused approach compared to general-purpose LLMs.