Name: Lambent/Qwen3-4B-Base-Continued-GRPO API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Lambent

Overview

Lambent/Qwen3-4B-Base-Continued-GRPO is a 4 billion parameter model built upon the Qwen3 architecture, featuring an extended context length of 40960 tokens. It distinguishes itself through an experimental "continued pretraining" approach using Generative Reinforcement Learning from Policy Optimization (GRPO). This method involved rewarding the model for generating completions that closely resembled target data, with reward calculations dynamically adjusted for different content types like creative text and code.

Key Capabilities & Training

The model was trained for 1034 steps using QLoRA with a rank of 128 and an alpha of 256. The core innovation lies in its multi-domain reward model, which utilizes:

Semantic similarity (via all-MiniLM-L6-v2 embeddings) and chrF++ for "creative" text rewards.
Semantic similarity, ROUGE-L, and a length penalty for "hybrid" content.
Levenshtein distance for specific reward types.

Performance Insights

Compared to its base model, Lambent/Qwen3-4B-Base-Continued-GRPO shows modest improvements across several diagnostic benchmarks:

arc_easy: +0.05% accuracy, +0.09% normalized accuracy.
lambada_openai: +0.61% accuracy and a 4.7% reduction in perplexity.
piqa: +0.21% accuracy.

These results suggest that the GRPO-based continued pretraining has a positive, albeit small, impact on the model's general language understanding and generation capabilities.

Overview

Overview

Key Capabilities & Training

Performance Insights

Full Model Card (README)