Overview
MiMo-V2-Flash: High-Speed Agentic MoE Model
MiMo-V2-Flash is a Mixture-of-Experts (MoE) language model from XiaomiMiMo, featuring 309B total parameters with 15B active parameters. It is engineered for high-speed reasoning and advanced agentic workflows, supporting an extensive 256k token context length.
Key Innovations & Capabilities
- Hybrid Attention Architecture: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio and a 128-token window, significantly reducing KV-cache storage while maintaining long-context performance via learnable attention sink bias.
- Multi-Token Prediction (MTP): Integrates a lightweight 0.33B parameter MTP module using dense FFNs, tripling output speed during inference and accelerating RL training rollouts.
- Efficient Pre-Training: Trained on 27T tokens using FP8 mixed precision and a native 32k sequence length, with support for up to 256k context.
- Agentic Capabilities: Achieves superior performance on SWE-Bench and complex reasoning tasks through Multi-Teacher On-Policy Distillation (MOPD) and large-scale agentic Reinforcement Learning (RL).
- Benchmark Performance: Demonstrates strong results across general, math, code, and long-context benchmarks, often outperforming models with larger active parameter counts, particularly in agentic tasks like SWE-Bench Verified (73.4%) and (\tau^2)-Bench (80.3%).
Ideal Use Cases
MiMo-V2-Flash is particularly well-suited for applications requiring:
- High-speed inference and cost-efficient deployment due to its MoE architecture and MTP.
- Complex reasoning and problem-solving in domains like mathematics and general knowledge.
- Advanced agentic workflows, including code generation, debugging (SWE-Bench), and general agent tasks.
- Long-context understanding and processing, leveraging its 256k context window for detailed analysis.