lllyx/Qwen3-4B-Base-GRPO
lllyx/Qwen3-4B-Base-GRPO is a 4 billion parameter language model based on Qwen3-4B-Base, fine-tuned using the GRPO reinforcement learning algorithm within the verl framework. This model is specifically optimized for mathematical reasoning and problem-solving, leveraging the DAPO-Math-17k-Processed dataset. It features a 32768-token context length and is designed to enhance performance in complex mathematical tasks through on-policy distillation.
Loading preview...
Model Overview
lllyx/Qwen3-4B-Base-GRPO is a 4 billion parameter model derived from the Qwen3-4B-Base architecture. It has undergone a specialized reinforcement learning (RL) phase using the GRPO algorithm within the verl framework. This training specifically targets the improvement of mathematical reasoning and problem-solving capabilities.
Key Characteristics
- Base Model: Qwen3-4B-Base
- Training Method: Reinforcement Learning (GRPO) for on-policy distillation.
- Primary Domain: Enhanced performance in mathematical reasoning.
- Training Dataset: Utilizes the
DAPO-Math-17k-Processeddataset, focusing on math-centric problems. - Context Length: Supports a substantial context window of 32768 tokens.
- Parameter Update: Employs full-parameter actor updates during the RL phase.
Training Details
The model was trained using a custom rule-based math reward function, with the reward model disabled. Validation was performed against datasets such as AIME25, AMC23, and AIME24. The training configuration included a learning rate of 1e-6 and a PPO mini-batch size of 64 over 1 epoch.
Good For
- Applications requiring strong mathematical reasoning and problem-solving.
- Research into on-policy distillation and reinforcement learning for language models.
- Tasks benefiting from a large context window in a math-focused domain.