DeepScaleR-1.5B-Preview: Scaled RL for Enhanced Reasoning
DeepScaleR-1.5B-Preview, developed by Agentica, is a 1.5 billion parameter language model derived from DeepSeek-R1-Distilled-Qwen-1.5B. Its core innovation lies in its use of distributed reinforcement learning (RL) with an iterative context lengthening strategy, enabling it to scale effectively to long context lengths up to 32768 tokens while maintaining high performance.
Key Capabilities
- Superior Mathematical Reasoning: Achieves 43.1% Pass@1 accuracy on AIME 2024, a 15% improvement over its base model (28.8%) and outperforming OpenAI's O1-Preview (40.0%) with significantly fewer parameters.
- Efficient RL Training: Employs Deepseek's Group Relative Policy Optimization (GRPO), a simplified RL algorithm that normalizes advantage functions and applies KL divergence regularization to prevent policy drift.
- Cost-Effective Long Context Scaling: Utilizes an iterative context lengthening approach during training, starting with 8K context and progressively extending to 16K and 24K, which significantly reduces compute costs and training time.
- Robust Evaluation: Demonstrates strong performance across various mathematical benchmarks, including AIME 2024, MATH 500 (87.8%), AMC 2023 (73.6%), and OlympiadBench (50.0%).
Good For
- Complex Mathematical Problem Solving: Excels at tasks requiring advanced reasoning and accurate numerical answers, as evidenced by its AIME 2024 performance.
- Research in Scalable RL: Provides a practical example of democratizing reinforcement learning for LLMs, particularly for scaling to long contexts.
- High-Performance Inference: Compatible with popular inference systems like vLLM, Hugging Face TGI, SGLang, and TensorRT-LLM, all supporting the OpenAI Chat Completions API format.
This model is released under the MIT License, promoting open and accessible AI development.