Kwai-Klear/GoLongRL-4B
Kwai-Klear/GoLongRL-4B is a 4 billion parameter model developed by Kwai-Klear, focusing on long-context reinforcement learning with verifiable rewards (RLVR). It utilizes a capability-oriented dataset and TMN-Reweight for optimizing heterogeneous rewards, achieving strong long-context performance. The model excels in tasks requiring precise retrieval, comprehension, and numerical reasoning over extended contexts, while preserving general capabilities.
Loading preview...
GoLongRL-4B: Long-Context Reinforcement Learning
GoLongRL-4B is a 4 billion parameter model from Kwai-Klear, specifically designed for long-context reinforcement learning with verifiable rewards (RLVR). This model introduces a novel post-training recipe that significantly enhances performance on tasks requiring extensive context understanding and processing. The framework is fully open-source, including its dataset and training code.
Key Capabilities & Innovations
- Capability-Oriented Dataset: Trained on a 23K sample dataset covering 9 distinct long-context task types, such as precise retrieval, numerical reasoning, structured extraction, and summarization. Each task incorporates natural evaluation metrics as reward functions.
- TMN-Reweight: A proposed method to address optimization challenges from heterogeneous rewards. It combines task-level mean normalization with difficulty-adaptive weighting, providing consistent improvements over vanilla GRPO.
- Strong Long-Context Performance: Achieves an average performance of 63.0 at the 4B scale, outperforming the closed-source QwenLong-L1.5 dataset even with its specialized AEPO algorithm. The model also preserves or improves general capabilities (MMLU-Pro, AIME24/25, GPQA) and shows substantial gains in dialogue memory benchmarks (LongMemEval +13.6).
Good For
- Applications requiring deep understanding and reasoning over very long texts.
- Research and development in reinforcement learning for language models.
- Tasks involving complex information retrieval, structured data extraction, and multi-document summarization.
- Developers interested in open-source long-context models and their training methodologies.