x32/Qwen3-0.6B-GRPO-GSM8K-Think

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.8BQuant:BF16Ctx Length:32kPublished:Jun 26, 2025License:apache-2.0Architecture:Transformer Open Weights Warm

x32/Qwen3-0.6B-GRPO-GSM8K-Think is an 0.8 billion parameter language model fine-tuned from Qwen/Qwen3-0.6B. Developed by x32, this model specializes in mathematical reasoning, particularly for grade school math problems (GSM8K), by employing the GRPO training method. It demonstrates improved performance on GSM8K benchmarks, making it suitable for applications requiring step-by-step mathematical problem-solving.

Loading preview...

Overview

x32/Qwen3-0.6B-GRPO-GSM8K-Think is a specialized language model derived from the Qwen3-0.6B architecture. It has been fine-tuned using the GRPO (Gradient-based Reasoning Policy Optimization) method, as introduced in the DeepSeekMath paper, to enhance its mathematical reasoning capabilities.

Key Capabilities

  • Enhanced Mathematical Reasoning: Specifically optimized for solving grade school mathematical word problems (GSM8K).
  • Step-by-Step Thinking: Designed to generate detailed reasoning processes using <think> tags before providing a final answer, aiding in transparency and interpretability.
  • Improved GSM8K Performance: Achieves notable improvements on the GSM8K benchmark, with its best checkpoint (GRPO-checkpoint-180) showing a +7.66% relative improvement over the local reproduction of the base Qwen3-0.6B model.

Training Details

The model was trained using the TRL framework and the GRPO method, which focuses on improving the model's ability to generate correct reasoning steps. This approach helps the model to not just arrive at the correct answer but also to articulate the logical path taken.

Use Cases

This model is particularly well-suited for:

  • Educational tools requiring automated math problem-solving.
  • Applications needing robust mathematical reasoning in a constrained environment.
  • Research into reasoning capabilities of smaller language models.