moonshotai/Kimi-Linear-48B-A3B-Base

TEXT GENERATIONConcurrency Cost:3Model Size:48BQuant:FP8Ctx Length:32kPublished:Oct 30, 2025License:mitArchitecture:Transformer0.1K Open Weights Cold

Kimi-Linear-48B-A3B-Base is a 48 billion parameter language model developed by Moonshot AI, featuring a hybrid linear attention architecture. It incorporates Kimi Delta Attention (KDA) to enhance efficiency and performance, particularly for long-context tasks up to 1 million tokens. This model significantly reduces KV cache requirements by up to 75% and boosts decoding throughput by up to 6x compared to traditional full attention models. It is designed for applications requiring high throughput and efficient processing of extensive context lengths.

Loading preview...

Overview

Kimi Linear is a 48 billion parameter model from Moonshot AI, distinguished by its hybrid linear attention architecture. It integrates Kimi Delta Attention (KDA), an optimized version of Gated DeltaNet, which refines the gating mechanism for more efficient use of finite-state RNN memory. This design allows Kimi Linear to surpass traditional full attention methods in various contexts, including short, long, and reinforcement learning scaling regimes.

Key Capabilities

  • Extended Context Handling: Supports context lengths up to 1 million tokens, making it highly suitable for tasks requiring extensive memory.
  • Enhanced Efficiency: Achieves up to 6x faster decoding and significantly reduces time per output token (TPOT) compared to full attention models.
  • Reduced Memory Footprint: Decreases the need for large KV caches by up to 75%.
  • Superior Performance: Outperforms full attention in various benchmarks, including long-context and RL-style tasks, as demonstrated in 1.4T token training runs.
  • Hybrid Architecture: Utilizes a 3:1 KDA-to-global MLA ratio to balance memory efficiency with performance quality.

Good For

  • Applications demanding high throughput and efficient processing of very long sequences.
  • Scenarios where memory optimization is critical, especially for large language models.
  • Tasks benefiting from extended context understanding and generation, such as document analysis or complex reasoning over large texts.