dnotitia/Qwen3-4B-Base
dnotitia/Qwen3-4B-Base is a 4.0 billion parameter causal language model from the Qwen3 series, pre-trained on 36 trillion tokens across 119 languages with a 32,768 token context length. Developed by Qwen and patched by dnotitia for improved training compatibility, it incorporates advanced training techniques and architectural refinements like qk layernorm. This base model is designed for broad language modeling, general knowledge acquisition, and enhanced reasoning skills, making it suitable for efficient training experiments.
Loading preview...
Qwen3-4B-Base Overview
dnotitia/Qwen3-4B-Base is a 4.0 billion parameter causal language model, part of the Qwen3 series. This specific version, patched by dnotitia, maintains the original Qwen3 weights but includes a refactored chat template and {% generation %} tags for better compatibility with the trl library's assistant_only_loss feature, making it ideal for efficient training experiments.
Key Qwen3 Highlights
Qwen3 represents the latest generation of Qwen models, built upon significant advancements in training data, model architecture, and optimization. Key improvements over Qwen2.5 include:
- Expanded Pre-training Corpus: Trained on 36 trillion tokens across 119 languages, tripling the language coverage of its predecessor. The dataset is rich in high-quality data, including coding, STEM, reasoning, and multilingual content.
- Advanced Training Techniques: Incorporates architectural refinements such as global-batch load balancing loss for MoE models and qk layernorm for all models, enhancing stability and performance.
- Three-stage Pre-training: A structured approach focusing on broad language modeling, then improving reasoning skills (STEM, coding), and finally enhancing long-context comprehension up to 32k tokens.
- Scaling Law Guided Tuning: Critical hyperparameters were systematically tuned using scaling law studies to optimize training dynamics and performance across different model scales.
Model Specifications
- Parameters: 4.0 billion (3.6 billion non-embedding)
- Layers: 36
- Attention Heads (GQA): 32 for Q, 8 for KV
- Context Length: 32,768 tokens
For detailed evaluation results and further information, refer to the official Qwen3 blog and GitHub repository.