XiaomiMiMo/MiMo-V2-Flash

Warm
Public
310B
FP8
32768
License: mit
Hugging Face
Overview

MiMo-V2-Flash: High-Speed Agentic MoE Model

MiMo-V2-Flash is a Mixture-of-Experts (MoE) language model from XiaomiMiMo, featuring 309B total parameters with 15B active parameters. It is engineered for high-speed reasoning and advanced agentic workflows, supporting an extensive 256k token context length.

Key Innovations & Capabilities

  • Hybrid Attention Architecture: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio and a 128-token window, significantly reducing KV-cache storage while maintaining long-context performance via learnable attention sink bias.
  • Multi-Token Prediction (MTP): Integrates a lightweight 0.33B parameter MTP module using dense FFNs, tripling output speed during inference and accelerating RL training rollouts.
  • Efficient Pre-Training: Trained on 27T tokens using FP8 mixed precision and a native 32k sequence length, with support for up to 256k context.
  • Agentic Capabilities: Achieves superior performance on SWE-Bench and complex reasoning tasks through Multi-Teacher On-Policy Distillation (MOPD) and large-scale agentic Reinforcement Learning (RL).
  • Benchmark Performance: Demonstrates strong results across general, math, code, and long-context benchmarks, often outperforming models with larger active parameter counts, particularly in agentic tasks like SWE-Bench Verified (73.4%) and (\tau^2)-Bench (80.3%).

Ideal Use Cases

MiMo-V2-Flash is particularly well-suited for applications requiring:

  • High-speed inference and cost-efficient deployment due to its MoE architecture and MTP.
  • Complex reasoning and problem-solving in domains like mathematics and general knowledge.
  • Advanced agentic workflows, including code generation, debugging (SWE-Bench), and general agent tasks.
  • Long-context understanding and processing, leveraging its 256k context window for detailed analysis.