Outlier-Ai/Outlier-40B

TEXT GENERATIONConcurrency Cost:1Model Size:14.8BQuant:FP8Ctx Length:32kPublished:Apr 7, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

Outlier-Ai/Outlier-40B is a 40 billion total parameter (14B active) ternary Mixture-of-Experts (MoE) overlay built on the Qwen2.5-14B-Instruct base model. Developed by a solo founder, it features a sparse architecture with 224 experts and TQ1_0 packing, designed to optimize MMLU performance per GB of RAM. This model is intended for research, benchmarking, and derivative fine-tunes, offering a balance between accuracy and memory footprint, particularly for Apple Silicon deployments.

Loading preview...

Outlier-40B V3.3: Ternary Mixture-of-Experts Overlay

Outlier-40B V3.3 is a unique 40 billion total parameter (approximately 14 billion active) model developed by a solo founder, implemented as a ternary Mixture-of-Experts (MoE) overlay on the frozen Qwen2.5-14B-Instruct base. This architecture utilizes a shared full-precision FFN combined with a gated ternary expert FFN per layer, featuring 224 experts and TQ1_0 packing.

Key Characteristics & Performance

  • Architecture: Ternary MoE overlay on Qwen2.5-14B-Instruct, optimizing for memory efficiency.
  • Scale: 40B total parameters with ~14B active, designed to maximize MMLU performance per GB of RAM.
  • Benchmarks: Achieves 77.80% on MMLU 5-shot, 84.64% on HellaSwag 10-shot, and 73.12% on ARC-Challenge 25-shot (lm-evaluation-harness v0.4.9.1).
  • Memory Optimization: The overlay compresses the expert path to approximately 1.6 bits per weight, reducing the expert memory footprint, especially when paired with int4 base quantization.
  • License: Apache 2.0, suitable for broad use.

Intended Use Cases

  • Research and Benchmarking: Ideal for exploring sparse MoE architectures and evaluating their performance characteristics.
  • Derivative Fine-tunes: Provides a strong base for further fine-tuning and experimentation.
  • Apple Silicon Deployment: Optimized for production use on Apple Silicon via specific MLX 4-bit shipping tiers, available from the Outlier-Ai organization.

Limitations

  • Requires the frozen Qwen2.5-14B-Instruct base model for inference; it is not a standalone checkpoint.
  • The shared FFN runs at full precision, requiring careful RAM planning if not combined with base quantization.
  • Primarily English-tuned, inheriting multilingual behavior directly from the base model.