agadelmoula-avey/Qwen3-4B-Base is a 4.0 billion parameter causal language model from the Qwen series, pre-trained by Qwen Team on an expanded 36 trillion token corpus covering 119 languages. This base model incorporates architectural refinements and a three-stage pre-training process to enhance broad language modeling, reasoning skills, and long-context comprehension up to 32,768 tokens. It is designed for general knowledge acquisition and foundational language understanding tasks.
Loading preview...
Qwen3-4B-Base Overview
Qwen3-4B-Base is a 4.0 billion parameter causal language model, part of the Qwen3 series developed by the Qwen Team. This model builds upon significant advancements in training data, architecture, and optimization techniques compared to its predecessor, Qwen2.5.
Key Capabilities & Features
- Expanded Pre-training Corpus: Trained on an extensive 36 trillion tokens across 119 languages, tripling the language coverage of Qwen2.5. The dataset includes a rich mix of high-quality data for coding, STEM, reasoning, and multilingual tasks.
- Architectural Refinements: Incorporates training techniques and architectural improvements like
qk layernormfor enhanced stability and performance. - Three-stage Pre-training: Utilizes a staged approach focusing on broad language modeling, improving reasoning skills (STEM, coding, logical reasoning), and enhancing long-context comprehension by extending sequence lengths up to 32,768 tokens.
- Scaling Law Guided Tuning: Hyperparameters are systematically tuned across the pre-training pipeline for optimal training dynamics and performance.
Model Specifications
- Parameters: 4.0 billion (3.6 billion non-embedding)
- Context Length: 32,768 tokens
- Layers: 36
- Attention Heads (GQA): 32 for Q, 8 for KV
For detailed evaluation results and further information, refer to the official Qwen3 blog and GitHub repository.