Outlier-40B: Exceeding Dense Teacher Performance with Ternary MoE
Outlier-40B is a significant advancement in efficient large language models, being the first ternary-quantized Mixture-of-Experts (MoE) model to surpass its dense teacher's benchmark score. Built upon the Qwen2.5-14B-Instruct architecture, this model achieves an impressive 81.60% on MMLU (5-shot), outperforming its teacher's ~79% score.
Key Capabilities & Features
- Superior Performance: Achieves higher MMLU scores than its larger, dense teacher model.
- Efficient Architecture: A 36-billion parameter MoE model with only ~14.4B active parameters per token, resulting in an inference RAM footprint of approximately 10 GB.
- Ternary Quantization: Utilizes 32 ternary {-1, 0, +1} experts per MoE layer with top-2 routing, enabling high performance within a constrained memory budget.
- Lightweight Distillation: Trained using Context-Aware KL Divergence (CAKLD) distillation, with only ~3% of parameters (lm_head + layer normalization weights) being trainable.
Ideal Use Cases
- Resource-Constrained Environments: Its efficient memory usage makes it suitable for deployment where RAM is a limiting factor.
- High-Performance Reasoning: Excels in tasks requiring strong reasoning capabilities, as demonstrated by its MMLU performance.
- Cost-Effective AI Solutions: Offers a compelling performance-to-cost ratio due to its efficient training and inference profile.