XiaomiMiMo/MiMo-7B-RL-Zero

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:32kPublished:Apr 29, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

XiaomiMiMo/MiMo-7B-RL-Zero is a 7 billion parameter language model developed by Xiaomi LLM-Core Team, specifically designed and optimized for reasoning tasks. This model is an RL-trained variant from the MiMo-7B base model, demonstrating superior performance in mathematics and code reasoning. It features a 32K context length and incorporates Multiple-Token Prediction (MTP) for enhanced inference efficiency.

Loading preview...

MiMo-7B-RL-Zero: A Reasoning-Focused 7B Language Model

MiMo-7B-RL-Zero is a 7 billion parameter model from the XiaomiMiMo series, meticulously developed by the Xiaomi LLM-Core Team to excel in reasoning tasks. Unlike many models that rely on larger base architectures for reasoning, MiMo-7B-RL-Zero demonstrates exceptional capabilities in mathematics and code, even matching the performance of larger 32B models and OpenAI o1-mini in certain benchmarks.

Key Capabilities & Innovations

  • Reasoning-Centric Pre-Training: The base MiMo-7B model was pre-trained from scratch with an optimized data pipeline, focusing on increasing reasoning pattern density and generating massive synthetic reasoning data. It utilized a three-stage data mixture strategy over approximately 25 trillion tokens.
  • Reinforcement Learning (RL) Optimization: MiMo-7B-RL-Zero is the result of RL training applied directly to the MiMo-7B base model. This process involved curating 130K mathematics and code problems, using rule-based accuracy rewards, and introducing a test difficulty-driven code reward to address sparse reward issues.
  • Multiple-Token Prediction (MTP): The model incorporates MTP as an additional training objective, enhancing performance and accelerating inference. With one MTP layer, it achieves an acceptance rate of about 90% for speculative decoding.
  • Efficient RL Infrastructure: XiaomiMiMo developed a Seamless Rollout Engine for faster RL training and validation, integrating continuous rollout, asynchronous reward computation, and early termination.

Performance Highlights

Evaluations show MiMo-7B-RL-Zero significantly improves over its base model in reasoning benchmarks. For instance, it achieves 93.6 Pass@1 on MATH500, 56.4 Pass@1 on AIME 2024, and 49.1 Pass@1 on LiveCodeBench v5, showcasing its strong performance in mathematical and coding reasoning.

Good for

  • Complex Reasoning Tasks: Ideal for applications requiring strong mathematical problem-solving and code generation/understanding.
  • Efficient Inference: Benefits from MTP integration, allowing for faster speculative decoding.
  • Research and Development: Provides a strong foundation for further research into reasoning-focused LLMs, particularly in the 7B parameter class.