Overview
DRIVE-RL is a 32.8 billion parameter model from Tencent's Hunyuan Team, specialized in competitive code generation. It builds upon a Qwen2.5-32B base model and employs a unique two-stage Reinforcement Learning (RL) pipeline with verifiable rewards. This approach addresses common issues in code generation like repetitive outputs and poor performance on difficult problems.
Key Capabilities & Training
The model's training pipeline involves:
- Difficulty-Aware Supervised Fine-Tuning (SFT): The initial Qwen2.5-32B is fine-tuned with a dataset where hard competitive programming samples are duplicated to emphasize learning from challenging problems.
- Two-Stage RL Process:
- Stage 1 (Entropy Expansion): Uses a large, uniformly distributed problem set with moderate rollouts (8) and a shorter context (24k) to increase output diversity and prevent entropy collapse.
- Stage 2 (Hard-Focus Curriculum): Updates on a small, high-quality set of challenging problems using Pre-GRPO with a large rollout budget (64-80 rollouts) to master difficult cases. This stage is crucial for significant performance gains on hard problems.
Performance & Differentiators
DRIVE-RL achieves state-of-the-art performance among models of similar scale in competitive code generation. It shows a +58.3% relative improvement on Codeforces OJ compared to its SFT baseline. Key findings highlight the importance of difficulty-aware training, entropy expansion, and large rollout budgets for tackling hard problems effectively. The model's strategy demonstrates strong scaling trends.
Good for
- Competitive Programming: Excels at generating correct and efficient code for complex algorithmic challenges.
- Code Generation Tasks: Particularly for scenarios requiring high accuracy and problem-solving capabilities on difficult inputs.
- Research in RL for Code: Provides a strong baseline and methodology for further exploration in reinforcement learning applied to code generation.