Kimi-Linear-REAP-35B-A3B-Instruct: Memory-Efficient MoE
Kimi-Linear-REAP-35B-A3B-Instruct is a 35 billion parameter Sparse Mixture-of-Experts (SMoE) causal language model, developed by tomascooler. It is a compressed variant of the Kimi-Linear-48B-A3B-Instruct, achieved through REAP (Router-weighted Expert Activation Pruning), a novel one-shot expert pruning method. This model significantly reduces memory footprint by 30% while preserving near-identical performance across key tasks.
Key Capabilities & Features
- Near-Lossless Performance: Achieves comparable accuracy to the 48B base model on benchmarks like HumanEval (87.2), MBPP (83.6), LiveCodeBench (30.2), AIME25 (40.0), and LongBench v2 (37.2).
- 30% Memory Reduction: Compressed from 48B to 35B parameters, lowering deployment costs and memory requirements.
- Preserved Functionality: Retains strong capabilities in code generation, mathematical reasoning, and long-context question-answering with a 1,048,576 token context length.
- Drop-in Compatibility: Works seamlessly with vanilla vLLM without requiring source modifications or custom patches.
- REAP Method: Utilizes a novel pruning technique that considers router gate values and expert activation norms, ensuring critical experts are retained and router control is preserved.
Ideal Use Cases
This model is particularly effective for resource-constrained environments, local deployments, and academic research where maintaining high performance with reduced memory consumption is crucial. Its strong performance in coding and reasoning tasks makes it suitable for applications requiring efficient and capable language understanding and generation.