Qwen3-8B-Base Overview

Qwen3-8B-Base is an 8.2 billion parameter causal language model, part of the latest Qwen3 series developed by Qwen. This model is a pre-trained base version, distinguished by significant advancements over its predecessor, Qwen2.5.

Key Improvements and Features

Expanded Pre-training Corpus: Trained on an extensive 36 trillion tokens across 119 languages, tripling the language coverage of Qwen2.5. The dataset includes a rich mix of high-quality data for coding, STEM, reasoning, and multilingual tasks.
Advanced Training Techniques: Incorporates architectural refinements like qk layernorm and a three-stage pre-training process. Stage 1 focuses on general language modeling, Stage 2 enhances reasoning (STEM, coding, logical reasoning), and Stage 3 extends long-context comprehension up to 32,768 tokens.
Optimized Hyperparameter Tuning: Utilizes scaling law studies to systematically tune hyperparameters, improving training dynamics and performance across different model scales.
Technical Specifications: Features 36 layers, 32 attention heads for Q, and 8 for KV, with a context length of 32,768 tokens.

When to Use This Model

This model is suitable for applications requiring a robust, multilingual base model with strong general knowledge and reasoning capabilities, especially where long context understanding is beneficial. Its pre-trained nature makes it a strong foundation for further fine-tuning on specific tasks.

Overview

Qwen3-8B-Base Overview

Key Improvements and Features

When to Use This Model

Full Model Card (README)