Lambent/Qwen3-4B-Base-Continued-GRPO is a 4 billion parameter Qwen3-based language model developed by Lambent, featuring a 40960 token context length. This model underwent experimental continued pretraining using a novel Generative Reinforcement Learning from Policy Optimization (GRPO) method, where rewards were dynamically calculated based on semantic similarity, character n-gram F-score (chrF++), ROUGE-L, and Levenshtein distance, tailored for creative text and code generation. It demonstrates slight improvements on tasks like arc_easy, lambada_openai, and piqa compared to its base model.
Loading preview...
Overview
Lambent/Qwen3-4B-Base-Continued-GRPO is a 4 billion parameter model built upon the Qwen3 architecture, featuring an extended context length of 40960 tokens. It distinguishes itself through an experimental "continued pretraining" approach using Generative Reinforcement Learning from Policy Optimization (GRPO). This method involved rewarding the model for generating completions that closely resembled target data, with reward calculations dynamically adjusted for different content types like creative text and code.
Key Capabilities & Training
The model was trained for 1034 steps using QLoRA with a rank of 128 and an alpha of 256. The core innovation lies in its multi-domain reward model, which utilizes:
- Semantic similarity (via
all-MiniLM-L6-v2embeddings) and chrF++ for "creative" text rewards. - Semantic similarity, ROUGE-L, and a length penalty for "hybrid" content.
- Levenshtein distance for specific reward types.
Performance Insights
Compared to its base model, Lambent/Qwen3-4B-Base-Continued-GRPO shows modest improvements across several diagnostic benchmarks:
- arc_easy: +0.05% accuracy, +0.09% normalized accuracy.
- lambada_openai: +0.61% accuracy and a 4.7% reduction in perplexity.
- piqa: +0.21% accuracy.
These results suggest that the GRPO-based continued pretraining has a positive, albeit small, impact on the model's general language understanding and generation capabilities.