POLARIS-7B-Preview: Advanced Reasoning through RL Post-Training
POLARIS-7B-Preview is a 7.6 billion parameter model developed by POLARIS-Project, distinguished by its innovative post-training method that applies reinforcement learning (RL) to significantly enhance advanced reasoning capabilities. This approach has shown to elevate the performance of base models, such as Qwen3-4B, on complex reasoning tasks.
Key Capabilities & Recipe Highlights
- Reinforcement Learning Scaling: Utilizes a unique RL recipe to refine and scale reasoning abilities, pushing the boundaries of open-recipe models.
- Data Difficulty Analysis: Employs pre-training analysis to map data difficulty, recommending a mirrored J-shaped distribution with a bias towards challenging problems for optimal training.
- Diversity-Based Rollout: Leverages diversity among rollouts to dynamically adjust sampling temperature during RL training.
- Inference-Time Length Extrapolation: Incorporates techniques for generating longer Chains of Thought (CoT) at inference, enabling a "train-short, generate-long" paradigm to mitigate computational burdens.
- Enhanced Exploration Efficiency: Achieves improved exploration through multi-stage training, allowing the model to "think longer" from the outset.
Performance & Benchmarks
POLARIS-7B-Preview demonstrates strong performance across various mathematical and reasoning benchmarks, often outperforming other 7B-class models and even some commercial systems. For instance, it achieves 72.6 on AIME24 avg@32 and 89.0 on AMC23 avg@8, showcasing its proficiency in advanced problem-solving. The model's training and evaluation codebase is built upon Verl, and its reward function is derived from DeepScaleR.