lllyx/Qwen3-4B-Base-GRPO

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:May 3, 2026License:otherArchitecture:Transformer0.0K Cold

lllyx/Qwen3-4B-Base-GRPO is a 4 billion parameter language model based on Qwen3-4B-Base, fine-tuned using the GRPO reinforcement learning algorithm within the verl framework. This model is specifically optimized for mathematical reasoning and problem-solving, leveraging the DAPO-Math-17k-Processed dataset. It features a 32768-token context length and is designed to enhance performance in complex mathematical tasks through on-policy distillation.

Loading preview...

Model Overview

lllyx/Qwen3-4B-Base-GRPO is a 4 billion parameter model derived from the Qwen3-4B-Base architecture. It has undergone a specialized reinforcement learning (RL) phase using the GRPO algorithm within the verl framework. This training specifically targets the improvement of mathematical reasoning and problem-solving capabilities.

Key Characteristics

  • Base Model: Qwen3-4B-Base
  • Training Method: Reinforcement Learning (GRPO) for on-policy distillation.
  • Primary Domain: Enhanced performance in mathematical reasoning.
  • Training Dataset: Utilizes the DAPO-Math-17k-Processed dataset, focusing on math-centric problems.
  • Context Length: Supports a substantial context window of 32768 tokens.
  • Parameter Update: Employs full-parameter actor updates during the RL phase.

Training Details

The model was trained using a custom rule-based math reward function, with the reward model disabled. Validation was performed against datasets such as AIME25, AMC23, and AIME24. The training configuration included a learning rate of 1e-6 and a PPO mini-batch size of 64 over 1 epoch.

Good For

  • Applications requiring strong mathematical reasoning and problem-solving.
  • Research into on-policy distillation and reinforcement learning for language models.
  • Tasks benefiting from a large context window in a math-focused domain.