XiaomiMiMo/MiMo-V2.5
MiMo-V2.5 by XiaomiMiMo is a native omnimodal model with a sparse MoE architecture (310B total / 15B activated parameters) and a 1M token context length. It supports text, image, video, and audio understanding through dedicated encoders and a hybrid attention backbone. The model excels in multimodal perception, long-context reasoning, and agentic workflows, making it suitable for complex, multi-sensory AI applications.
Loading preview...
MiMo-V2.5: Omnimodal Agentic Model
MiMo-V2.5, developed by XiaomiMiMo, is a powerful omnimodal model built on a sparse Mixture of Experts (MoE) architecture, featuring 310 billion total parameters with 15 billion activated. It supports an extensive context length of up to 1 million tokens, enabling deep understanding and reasoning across various data types.
Key Capabilities
- Native Omnimodal Understanding: Processes and integrates text, image, video, and audio inputs within a unified architecture.
- Hybrid Attention Architecture: Utilizes a hybrid design of Sliding Window Attention (SWA) and Global Attention (GA) to optimize KV-cache storage while maintaining long-context performance.
- Dedicated Encoders: Incorporates a 729M-parameter Vision Transformer (ViT) and a 261M-parameter audio encoder for high-quality multimodal perception.
- Agentic Workflows: Enhanced with post-training techniques including SFT, large-scale agentic RL, and Multi-Teacher On-Policy Distillation (MOPD) for strong agentic capabilities.
- Efficient Inference: Features Multi-Token Prediction (MTP) modules to accelerate inference through speculative decoding.
Good For
- Applications requiring multimodal perception across text, image, video, and audio.
- Tasks demanding long-context reasoning and understanding.
- Developing agentic systems that can interact and perform complex workflows.
- Scenarios where efficient processing of large multimodal inputs is crucial.