baidu/ERNIE-4.5-21B-A3B-PT

TEXT GENERATIONConcurrency Cost:1Model Size:21BQuant:FP8Ctx Length:32kPublished:Jun 28, 2025License:apache-2.0Architecture:Transformer0.2K Open Weights Cold

ERNIE-4.5-21B-A3B-PT is a 21 billion total parameter, 3 billion activated parameter text-based Mixture-of-Experts (MoE) model developed by Baidu. This post-trained model features a 32768 token context length and utilizes a heterogeneous MoE structure with modality-isolated routing for efficient multimodal pre-training, though this specific variant is text-only. It is optimized for general-purpose language understanding and generation, leveraging advanced training infrastructure and post-training methods like SFT, DPO, or UPO.

Loading preview...

ERNIE-4.5-21B-A3B-PT Overview

ERNIE-4.5-21B-A3B-PT is a text-focused Mixture-of-Experts (MoE) model from Baidu, featuring 21 billion total parameters with 3 billion activated parameters per token. It is built upon the ERNIE 4.5 architecture, which incorporates advanced techniques like Multimodal Heterogeneous MoE Pre-Training (though this specific model is text-only) and Scaling-Efficient Infrastructure for high-performance training and inference. The model benefits from a novel heterogeneous hybrid parallelism and hierarchical load balancing strategy, alongside FP8 mixed-precision training and fine-grained recomputation methods.

Key Capabilities

  • Efficient MoE Architecture: Utilizes a 21B total parameter MoE design with 3B activated parameters, ensuring efficient processing.
  • Optimized for Text: Specifically post-trained for general-purpose language understanding and generation tasks.
  • Advanced Training: Leverages sophisticated training infrastructure for high throughput and efficient resource utilization.
  • High Context Length: Supports a context length of 32768 tokens, enabling processing of longer inputs.

Good For

  • General Language Tasks: Excels in a wide range of text-based applications requiring robust language understanding and generation.
  • Efficient Deployment: Designed with inference optimizations like multi-expert parallel collaboration and convolutional code quantization for 4-bit/2-bit lossless quantization, making it suitable for various hardware platforms.