XiaomiMiMo/MiMo-7B-RL-0530
XiaomiMiMo/MiMo-7B-RL-0530 is a 7 billion parameter language model developed by XiaomiMiMo, specifically engineered for enhanced reasoning capabilities in mathematics and code. This model leverages advanced pre-training strategies to increase reasoning pattern density and incorporates a sophisticated Reinforcement Learning (RL) post-training recipe. It achieves strong performance on benchmarks like MATH500 (97.2%) and AIME 2024 (80.1%), making it suitable for complex problem-solving tasks.
Loading preview...
MiMo-7B-RL-0530: A Reasoning-Focused 7B LLM
MiMo-7B-RL-0530 is a 7 billion parameter language model from XiaomiMiMo, designed to unlock and enhance reasoning potential, particularly in mathematics and code. This model is an improved version of MiMo-7B-RL, with scaled SFT data and expanded RL training window size, leading to continuous performance improvements.
Key Innovations & Capabilities
- Reasoning-Centric Pre-Training: The base MiMo-7B model was pre-trained from scratch with an optimized data preprocessing pipeline, multi-dimensional data filtering to increase reasoning pattern density, and massive synthetic reasoning data generation. It was trained on approximately 25 trillion tokens.
- Multiple-Token Prediction (MTP): Incorporates MTP as an additional training objective to enhance performance and accelerate inference, with an acceptance rate of about 90% for speculative decoding.
- Advanced RL Post-Training: Utilizes a curated dataset of 130K mathematics and code problems for RL training, employing rule-based accuracy rewards. It introduces a test difficulty-driven code reward to mitigate sparse reward issues for challenging code problems and a data re-sampling strategy for efficient policy updates.
- Strong Performance: Demonstrates superior performance on both mathematics and code reasoning tasks. The MiMo-7B-RL-0530 variant shows significant improvements over MiMo-7B-RL, achieving 97.2% on MATH500 and 80.1% on AIME 2024, and 60.9% on LiveCodeBench v5.
- Efficient RL Infrastructure: Features a Seamless Rollout Engine for accelerated RL training and validation, achieving 2.29x faster training and 1.96x faster validation by minimizing GPU idle time.
Recommended Use Cases
- Mathematical Problem Solving: Excels in complex mathematical reasoning, as evidenced by high scores on MATH500 and AIME benchmarks.
- Code Generation and Debugging: Strong performance on LiveCodeBench indicates its suitability for code-related tasks.
- General Reasoning Tasks: While specialized, its robust pre-training and RL fine-tuning contribute to strong general reasoning capabilities, as seen in GPQA-Diamond scores.
For more technical details, refer to the Technical Report.