XiaomiMiMo/MiMo-7B-RL

TEXT GENERATIONConcurrency Cost:1Model Size:7BQuant:FP8Ctx Length:32kPublished:Apr 29, 2025License:mitArchitecture:Transformer0.3K Open Weights Cold

XiaomiMiMo/MiMo-7B-RL is a 7 billion parameter causal language model developed by XiaomiMiMo, specifically engineered to unlock and enhance reasoning capabilities in both mathematics and code. This model, with a 32768 token context length, is the RL-trained version from an SFT model, demonstrating superior performance in reasoning tasks, matching the performance of OpenAI o1-mini. It incorporates a unique pre-training strategy to increase reasoning pattern density and a novel post-training recipe with rule-based accuracy rewards and test difficulty-driven code rewards. MiMo-7B-RL is optimized for complex problem-solving in STEM fields and programming.

Loading preview...

Overview

XiaomiMiMo/MiMo-7B-RL is a 7 billion parameter language model developed by XiaomiMiMo, specifically designed to excel in reasoning tasks across mathematics and code. Unlike many larger models, MiMo-7B-RL demonstrates extraordinary reasoning potential within its smaller footprint, even surpassing some 32B models. It achieves this through a comprehensive approach that optimizes both pre-training and post-training strategies, focusing on enhancing the inherent reasoning capabilities of the base model.

Key Capabilities & Innovations

  • Reasoning-Focused Pre-Training: The base model, MiMo-7B-Base, was pre-trained with an optimized data pipeline to increase reasoning pattern density, including massive diverse synthetic reasoning data and a three-stage data mixture strategy over approximately 25 trillion tokens. It also incorporates Multiple-Token Prediction (MTP) as an additional training objective for enhanced performance and accelerated inference.
  • Advanced Post-Training (RL): MiMo-7B-RL is the result of Reinforcement Learning (RL) applied to an SFT model. It utilizes a curated dataset of 130K mathematics and code problems with rule-based accuracy rewards. To address sparse rewards in challenging code problems, it introduces a test difficulty-driven code reward system, assigning fine-grained scores for varying test case difficulties.
  • Performance: MiMo-7B-RL shows strong performance on benchmarks like MATH500 (95.8 Pass@1), AIME 2024 (68.2 Pass@1), and LiveCodeBench v5 (57.8 Pass@1), often matching or exceeding models like OpenAI o1-mini in its category.
  • Efficient RL Infrastructure: Features a Seamless Rollout Engine for accelerated RL training and validation, integrating continuous rollout, asynchronous reward computation, and early termination to minimize GPU idle time.

When to Use This Model

  • Complex Reasoning Tasks: Ideal for applications requiring strong mathematical and logical reasoning, such as solving competitive programming problems or advanced STEM questions.
  • Code Generation and Problem Solving: Excels in generating and debugging code, particularly for challenging problems where fine-grained feedback is crucial.
  • Efficiency-Sensitive Applications: Its 7B parameter size combined with MTP support allows for efficient inference, making it suitable for scenarios where performance and speed are critical.