empero-ai/openNemo-Cascade-2-30B-A3B
openNemo-Cascade-2-30B-A3B by Empero AI is a 30.87 billion parameter Mixture-of-Experts (MoE) model with 3 billion active parameters per token, featuring a 32768 token context length. It is a pure-PyTorch re-implementation of NVIDIA's Nemotron-Cascade-2-30B-A3B, removing external CUDA kernel dependencies for full compatibility with bitsandbytes 4-bit quantization and QLoRA fine-tuning. This model excels in complex reasoning tasks, achieving gold medal performance on benchmarks like IMO 2025 and IOI 2025, making it suitable for advanced problem-solving and mathematical reasoning applications.
Loading preview...
openNemo-Cascade-2-30B-A3B: Pure PyTorch MoE for Advanced Reasoning
Empero AI's openNemo-Cascade-2-30B-A3B is a 30.87 billion parameter Mixture-of-Experts (MoE) model, with approximately 3 billion active parameters per token. It is a direct, pure-PyTorch replacement for NVIDIA's Nemotron-Cascade-2-30B-A3B, designed to eliminate dependencies on external CUDA kernels like mamba-ssm and causal-conv1d.
Key Differentiators & Capabilities
- Enhanced Quantization & Fine-tuning: By replacing CUDA kernels with native PyTorch operations, this model enables full compatibility with bitsandbytes 4-bit quantization and QLoRA fine-tuning on consumer GPUs, loading in approximately 17 GB VRAM when quantized.
- Preserved Performance: It retains the original Nemotron-Cascade-2's architecture and weights, ensuring identical performance. The original model achieved gold medal status on challenging reasoning benchmarks such as IMO 2025 (35 pts) and IOI 2025 (439.3 pts).
- Flexible Architecture: The model is a 52-layer hybrid, combining Mamba2 SSM blocks, Mixture-of-Experts blocks (128 routed experts, top-6 selected), and Grouped Query Attention blocks.
- Simplified Deployment: No
mamba-ssmorcausal-conv1dinstallation is required, simplifying setup and avoiding common CUDA version conflicts. - Memory Optimization: Includes an automatic fix for async weight loading to prevent Out-of-Memory (OOM) errors during 4-bit quantization on GPUs with less VRAM.
Ideal Use Cases
- Advanced Reasoning & Problem Solving: Excels in complex mathematical and logical reasoning tasks, as demonstrated by its benchmark performance.
- Resource-Constrained Environments: Suitable for deployment and fine-tuning on consumer-grade GPUs due to its 4-bit quantization compatibility and reduced VRAM footprint.
- Research & Development: Provides a flexible, pure-PyTorch base for experimenting with MoE models, quantization, and QLoRA fine-tuning without kernel-related hurdles.