Overview
Qwen3-14B-Base Overview
Qwen3-14B-Base is a 14.8 billion parameter pre-trained causal language model from the Qwen series, building upon advancements in training data, architecture, and optimization. It features a substantial expansion in its pre-training corpus, now encompassing 36 trillion tokens across 119 languages, significantly tripling the language coverage of its predecessor, Qwen2.5. The dataset includes a rich mix of high-quality data, such as coding, STEM, reasoning, and multilingual content.
Key Improvements & Features
- Expanded Pre-training Corpus: Trained on 36 trillion tokens across 119 languages, with a focus on high-quality data for coding, STEM, and reasoning.
- Architectural Refinements: Incorporates training techniques like global-batch load balancing loss for MoE models and qk layernorm for all models, enhancing stability and performance.
- Three-stage Pre-training: A structured approach that first builds general language modeling, then refines reasoning skills (STEM, coding, logical reasoning), and finally extends long-context comprehension up to 32,768 tokens.
- Scaling Law Guided Tuning: Hyperparameters are systematically tuned using scaling law studies across the pre-training pipeline for optimal training dynamics and performance.
Model Specifications
- Parameters: 14.8 billion (13.2 billion non-embedding)
- Context Length: 32,768 tokens
- Layers: 40
- Attention Heads (GQA): 40 for Q, 8 for KV
For detailed evaluation results and further information, refer to the official Qwen3 blog and GitHub repository.