Outlier-Ai/Outlier-40B
Outlier-Ai/Outlier-40B is a 40 billion total parameter (14B active) ternary Mixture-of-Experts (MoE) overlay built on the Qwen2.5-14B-Instruct base model. Developed by a solo founder, it features a sparse architecture with 224 experts and TQ1_0 packing, designed to optimize MMLU performance per GB of RAM. This model is intended for research, benchmarking, and derivative fine-tunes, offering a balance between accuracy and memory footprint, particularly for Apple Silicon deployments.
Loading preview...
Outlier-40B V3.3: Ternary Mixture-of-Experts Overlay
Outlier-40B V3.3 is a unique 40 billion total parameter (approximately 14 billion active) model developed by a solo founder, implemented as a ternary Mixture-of-Experts (MoE) overlay on the frozen Qwen2.5-14B-Instruct base. This architecture utilizes a shared full-precision FFN combined with a gated ternary expert FFN per layer, featuring 224 experts and TQ1_0 packing.
Key Characteristics & Performance
- Architecture: Ternary MoE overlay on Qwen2.5-14B-Instruct, optimizing for memory efficiency.
- Scale: 40B total parameters with ~14B active, designed to maximize MMLU performance per GB of RAM.
- Benchmarks: Achieves 77.80% on MMLU 5-shot, 84.64% on HellaSwag 10-shot, and 73.12% on ARC-Challenge 25-shot (lm-evaluation-harness v0.4.9.1).
- Memory Optimization: The overlay compresses the expert path to approximately 1.6 bits per weight, reducing the expert memory footprint, especially when paired with int4 base quantization.
- License: Apache 2.0, suitable for broad use.
Intended Use Cases
- Research and Benchmarking: Ideal for exploring sparse MoE architectures and evaluating their performance characteristics.
- Derivative Fine-tunes: Provides a strong base for further fine-tuning and experimentation.
- Apple Silicon Deployment: Optimized for production use on Apple Silicon via specific MLX 4-bit shipping tiers, available from the Outlier-Ai organization.
Limitations
- Requires the frozen Qwen2.5-14B-Instruct base model for inference; it is not a standalone checkpoint.
- The shared FFN runs at full precision, requiring careful RAM planning if not combined with base quantization.
- Primarily English-tuned, inheriting multilingual behavior directly from the base model.