Name: lllyx/Qwen3-4B-Base-GRPO API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: lllyx

Model Overview

lllyx/Qwen3-4B-Base-GRPO is a 4 billion parameter model derived from the Qwen3-4B-Base architecture. It has undergone a specialized reinforcement learning (RL) phase using the GRPO algorithm within the verl framework. This training specifically targets the improvement of mathematical reasoning and problem-solving capabilities.

Key Characteristics

Base Model: Qwen3-4B-Base
Training Method: Reinforcement Learning (GRPO) for on-policy distillation.
Primary Domain: Enhanced performance in mathematical reasoning.
Training Dataset: Utilizes the DAPO-Math-17k-Processed dataset, focusing on math-centric problems.
Context Length: Supports a substantial context window of 32768 tokens.
Parameter Update: Employs full-parameter actor updates during the RL phase.

Training Details

The model was trained using a custom rule-based math reward function, with the reward model disabled. Validation was performed against datasets such as AIME25, AMC23, and AIME24. The training configuration included a learning rate of 1e-6 and a PPO mini-batch size of 64 over 1 epoch.

Good For

Applications requiring strong mathematical reasoning and problem-solving.
Research into on-policy distillation and reinforcement learning for language models.
Tasks benefiting from a large context window in a math-focused domain.

Overview

Model Overview

Key Characteristics

Training Details

Good For

Full Model Card (README)