Name: moonshotai/Kimi-Linear-48B-A3B-Instruct API
Brand: Featherless.ai
Price: 25.00 USD
Availability: InStock
Author: moonshotai

Overview

Kimi Linear is a 48 billion parameter model from MoonshotAI, featuring a novel hybrid linear attention architecture designed for efficiency and superior performance across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At its core is Kimi Delta Attention (KDA), an optimized version of Gated DeltaNet that enhances the use of finite-state RNN memory. This architecture enables the model to handle context lengths up to 1 million tokens while significantly improving hardware efficiency.

Key Capabilities

Kimi Delta Attention (KDA): Employs a refined linear attention mechanism with fine-grained gating for improved performance.
Hybrid Architecture: Integrates a 3:1 KDA-to-global MLA ratio, reducing memory footprint while maintaining or exceeding the quality of full attention models.
Superior Performance: Outperforms traditional full attention methods in long-context and RL-style benchmarks, achieving 51.0 on MMLU-Pro (4k context) and 84.3 on RULER (128k context).
High Throughput: Delivers up to 6x faster decoding and substantially reduces time per output token (TPOT), especially for long sequences.
Memory Efficiency: Reduces KV cache requirements by up to 75% for contexts as long as 1M tokens.

Good for

Applications requiring efficient processing of extremely long context lengths (up to 1M tokens).
Tasks where high decoding throughput and reduced memory usage are critical.
Scenarios demanding strong performance in both short and extended contexts, including reinforcement learning applications.

Overview

Overview

Key Capabilities

Good for

Full Model Card (README)