XiaomiMiMo/MiMo-7B-RL-Zero
XiaomiMiMo/MiMo-7B-RL-Zero is a 7 billion parameter language model developed by Xiaomi LLM-Core Team, specifically designed and optimized for reasoning tasks. This model is an RL-trained variant from the MiMo-7B base model, demonstrating superior performance in mathematics and code reasoning. It features a 32K context length and incorporates Multiple-Token Prediction (MTP) for enhanced inference efficiency.
Loading preview...
MiMo-7B-RL-Zero: A Reasoning-Focused 7B Language Model
MiMo-7B-RL-Zero is a 7 billion parameter model from the XiaomiMiMo series, meticulously developed by the Xiaomi LLM-Core Team to excel in reasoning tasks. Unlike many models that rely on larger base architectures for reasoning, MiMo-7B-RL-Zero demonstrates exceptional capabilities in mathematics and code, even matching the performance of larger 32B models and OpenAI o1-mini in certain benchmarks.
Key Capabilities & Innovations
- Reasoning-Centric Pre-Training: The base MiMo-7B model was pre-trained from scratch with an optimized data pipeline, focusing on increasing reasoning pattern density and generating massive synthetic reasoning data. It utilized a three-stage data mixture strategy over approximately 25 trillion tokens.
- Reinforcement Learning (RL) Optimization: MiMo-7B-RL-Zero is the result of RL training applied directly to the MiMo-7B base model. This process involved curating 130K mathematics and code problems, using rule-based accuracy rewards, and introducing a test difficulty-driven code reward to address sparse reward issues.
- Multiple-Token Prediction (MTP): The model incorporates MTP as an additional training objective, enhancing performance and accelerating inference. With one MTP layer, it achieves an acceptance rate of about 90% for speculative decoding.
- Efficient RL Infrastructure: XiaomiMiMo developed a Seamless Rollout Engine for faster RL training and validation, integrating continuous rollout, asynchronous reward computation, and early termination.
Performance Highlights
Evaluations show MiMo-7B-RL-Zero significantly improves over its base model in reasoning benchmarks. For instance, it achieves 93.6 Pass@1 on MATH500, 56.4 Pass@1 on AIME 2024, and 49.1 Pass@1 on LiveCodeBench v5, showcasing its strong performance in mathematical and coding reasoning.
Good for
- Complex Reasoning Tasks: Ideal for applications requiring strong mathematical problem-solving and code generation/understanding.
- Efficient Inference: Benefits from MTP integration, allowing for faster speculative decoding.
- Research and Development: Provides a strong foundation for further research into reasoning-focused LLMs, particularly in the 7B parameter class.