Overview
Qwen3-14B-Base is a 14.8 billion parameter pre-trained causal language model, part of the latest Qwen series. It builds upon Qwen2.5 with significant advancements in training data, model architecture, and optimization techniques. The model utilizes a context length of 32,768 tokens.
Key Improvements & Capabilities
- Expanded Higher-Quality Pre-training Corpus: Trained on 36 trillion tokens across 119 languages, tripling the language coverage of its predecessor. This corpus includes a rich mix of coding, STEM, reasoning, book, multilingual, and synthetic data.
- Advanced Training Techniques: Incorporates architectural refinements such as qk layernorm for improved stability and performance across all models.
- Three-stage Pre-training:
- Stage 1: Focuses on broad language modeling and general knowledge.
- Stage 2: Enhances reasoning skills, including STEM, coding, and logical reasoning.
- Stage 3: Improves long-context comprehension by extending training sequence lengths.
- Scaling Law Guided Hyperparameter Tuning: Critical hyperparameters were systematically tuned for dense and MoE models, optimizing training dynamics and final performance.
Model Specifications
- Type: Causal Language Model
- Parameters: 14.8 billion (13.2 billion non-embedding)
- Layers: 40
- Attention Heads (GQA): 40 for Q, 8 for KV
- Context Length: 32,768 tokens
For detailed evaluation results and further information, refer to the official Qwen3 blog and GitHub repository.