XiaomiMiMo/MiMo-7B-RL-0530

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:32kPublished:May 30, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

XiaomiMiMo/MiMo-7B-RL-0530 is a 7 billion parameter language model developed by XiaomiMiMo, specifically engineered for enhanced reasoning capabilities in mathematics and code. This model leverages advanced pre-training strategies to increase reasoning pattern density and incorporates a sophisticated Reinforcement Learning (RL) post-training recipe. It achieves strong performance on benchmarks like MATH500 (97.2%) and AIME 2024 (80.1%), making it suitable for complex problem-solving tasks.

Loading preview...

MiMo-7B-RL-0530: A Reasoning-Focused 7B LLM

MiMo-7B-RL-0530 is a 7 billion parameter language model from XiaomiMiMo, designed to unlock and enhance reasoning potential, particularly in mathematics and code. This model is an improved version of MiMo-7B-RL, with scaled SFT data and expanded RL training window size, leading to continuous performance improvements.

Key Innovations & Capabilities

  • Reasoning-Centric Pre-Training: The base MiMo-7B model was pre-trained from scratch with an optimized data preprocessing pipeline, multi-dimensional data filtering to increase reasoning pattern density, and massive synthetic reasoning data generation. It was trained on approximately 25 trillion tokens.
  • Multiple-Token Prediction (MTP): Incorporates MTP as an additional training objective to enhance performance and accelerate inference, with an acceptance rate of about 90% for speculative decoding.
  • Advanced RL Post-Training: Utilizes a curated dataset of 130K mathematics and code problems for RL training, employing rule-based accuracy rewards. It introduces a test difficulty-driven code reward to mitigate sparse reward issues for challenging code problems and a data re-sampling strategy for efficient policy updates.
  • Strong Performance: Demonstrates superior performance on both mathematics and code reasoning tasks. The MiMo-7B-RL-0530 variant shows significant improvements over MiMo-7B-RL, achieving 97.2% on MATH500 and 80.1% on AIME 2024, and 60.9% on LiveCodeBench v5.
  • Efficient RL Infrastructure: Features a Seamless Rollout Engine for accelerated RL training and validation, achieving 2.29x faster training and 1.96x faster validation by minimizing GPU idle time.

Recommended Use Cases

  • Mathematical Problem Solving: Excels in complex mathematical reasoning, as evidenced by high scores on MATH500 and AIME benchmarks.
  • Code Generation and Debugging: Strong performance on LiveCodeBench indicates its suitability for code-related tasks.
  • General Reasoning Tasks: While specialized, its robust pre-training and RL fine-tuning contribute to strong general reasoning capabilities, as seen in GPQA-Diamond scores.

For more technical details, refer to the Technical Report.