moonshotai/Kimi-Linear-48B-A3B-Instruct
Hugging Face
TEXT GENERATIONConcurrency Cost:3Model Size:48BQuant:FP8Ctx Length:32kPublished:Oct 30, 2025License:mitArchitecture:Transformer0.6K Open Weights Warm

Kimi-Linear-48B-A3B-Instruct, developed by MoonshotAI, is a 48 billion parameter instruction-tuned model featuring a hybrid linear attention architecture. It utilizes Kimi Delta Attention (KDA) to achieve superior performance and hardware efficiency, particularly for long-context tasks up to 1M tokens. This model significantly reduces KV cache usage by up to 75% and boosts decoding throughput by up to 6x, making it ideal for applications requiring efficient processing of very long sequences.

Loading preview...

Overview

Kimi Linear is a 48 billion parameter model from MoonshotAI, featuring a novel hybrid linear attention architecture designed for efficiency and superior performance across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At its core is Kimi Delta Attention (KDA), an optimized version of Gated DeltaNet that enhances the use of finite-state RNN memory. This architecture enables the model to handle context lengths up to 1 million tokens while significantly improving hardware efficiency.

Key Capabilities

  • Kimi Delta Attention (KDA): Employs a refined linear attention mechanism with fine-grained gating for improved performance.
  • Hybrid Architecture: Integrates a 3:1 KDA-to-global MLA ratio, reducing memory footprint while maintaining or exceeding the quality of full attention models.
  • Superior Performance: Outperforms traditional full attention methods in long-context and RL-style benchmarks, achieving 51.0 on MMLU-Pro (4k context) and 84.3 on RULER (128k context).
  • High Throughput: Delivers up to 6x faster decoding and substantially reduces time per output token (TPOT), especially for long sequences.
  • Memory Efficiency: Reduces KV cache requirements by up to 75% for contexts as long as 1M tokens.

Good for

  • Applications requiring efficient processing of extremely long context lengths (up to 1M tokens).
  • Tasks where high decoding throughput and reduced memory usage are critical.
  • Scenarios demanding strong performance in both short and extended contexts, including reinforcement learning applications.