MiMo-V2-Flash by XiaomiMiMo is a 309B total parameter Mixture-of-Experts (MoE) language model with 15B active parameters, designed for high-speed reasoning and agentic workflows. It features a novel hybrid attention architecture and Multi-Token Prediction (MTP) for efficient inference and long-context handling up to 256k tokens. The model excels in complex reasoning tasks and agentic capabilities, including code generation and web development, achieved through advanced post-training techniques like Multi-Teacher On-Policy Distillation (MOPD) and large-scale agentic RL.
MiMo-V2-Flash: High-Speed Agentic MoE Model
MiMo-V2-Flash, developed by XiaomiMiMo, is a Mixture-of-Experts (MoE) language model featuring 309B total parameters and 15B active parameters. It is engineered for high-speed reasoning and agentic workflows, balancing long-context modeling with inference efficiency.
Key Innovations & Capabilities
- Hybrid Attention Architecture: Combines Sliding Window Attention (SWA) and Global Attention (GA) with an aggressive 128-token window and learnable attention sink bias, reducing KV-cache storage by nearly 6x while supporting up to 256k context length.
- Multi-Token Prediction (MTP): A lightweight 0.33B parameter module that triples output speed during inference and accelerates RL training rollouts.
- Efficient Pre-Training: Trained on 27T tokens using FP8 mixed precision and a native 32k sequence length.
- Advanced Post-Training: Utilizes Multi-Teacher On-Policy Distillation (MOPD) and large-scale agentic Reinforcement Learning (RL) on massive code agent environments (100,000+ tasks) and multimodal verifiers for web development.
Performance Highlights
The model demonstrates strong performance across various benchmarks, often surpassing models with larger active parameter counts. Notably, it achieves superior results in:
- Reasoning: High scores on MMLU-Pro, GPQA-Diamond, and AIME 2025.
- Code Agent: Achieves 73.4% on SWE-Bench Verified and strong results on Terminal-Bench, indicating robust capabilities for automated code tasks.
- Long Context: Maintains high accuracy up to 256k context length, with 96.7% on NIAH-Multi at 256K.
Recommended Use Cases
- Agentic Workflows: Ideal for tasks requiring complex reasoning, tool use, and automated problem-solving, particularly in code generation and web development.
- High-Throughput Applications: Its MTP module and efficient architecture make it suitable for scenarios demanding fast inference and high output speeds.
- Long-Context Understanding: Excellent for processing and generating content over very long documents or conversations.