XiaomiMiMo/MiMo-V2.5

Hugging Face
VISIONConcurrency Cost:4Model Size:311BQuant:FP8Ctx Length:32kPublished:Apr 27, 2026License:mitArchitecture:Transformer0.3K Open Weights Warm

MiMo-V2.5 by XiaomiMiMo is a native omnimodal model with a sparse MoE architecture (310B total / 15B activated parameters) and a 1M token context length. It supports text, image, video, and audio understanding through dedicated encoders and a hybrid attention backbone. The model excels in multimodal perception, long-context reasoning, and agentic workflows, making it suitable for complex, multi-sensory AI applications.

Loading preview...

MiMo-V2.5: Omnimodal Agentic Model

MiMo-V2.5, developed by XiaomiMiMo, is a powerful omnimodal model built on a sparse Mixture of Experts (MoE) architecture, featuring 310 billion total parameters with 15 billion activated. It supports an extensive context length of up to 1 million tokens, enabling deep understanding and reasoning across various data types.

Key Capabilities

  • Native Omnimodal Understanding: Processes and integrates text, image, video, and audio inputs within a unified architecture.
  • Hybrid Attention Architecture: Utilizes a hybrid design of Sliding Window Attention (SWA) and Global Attention (GA) to optimize KV-cache storage while maintaining long-context performance.
  • Dedicated Encoders: Incorporates a 729M-parameter Vision Transformer (ViT) and a 261M-parameter audio encoder for high-quality multimodal perception.
  • Agentic Workflows: Enhanced with post-training techniques including SFT, large-scale agentic RL, and Multi-Teacher On-Policy Distillation (MOPD) for strong agentic capabilities.
  • Efficient Inference: Features Multi-Token Prediction (MTP) modules to accelerate inference through speculative decoding.

Good For

  • Applications requiring multimodal perception across text, image, video, and audio.
  • Tasks demanding long-context reasoning and understanding.
  • Developing agentic systems that can interact and perform complex workflows.
  • Scenarios where efficient processing of large multimodal inputs is crucial.