Overview

Qwen3-8B-Base-Math is an 8.2 billion parameter causal language model from the Qwen3 series, developed by Qwen. It builds upon previous generations with significant advancements in its training corpus, model architecture, and optimization techniques. The model was pre-trained on an expanded, higher-quality dataset of 36 trillion tokens covering 119 languages, a substantial increase in linguistic diversity and data richness compared to Qwen2.5.

Key Capabilities & Improvements

Expanded Pre-training Corpus: Utilizes 36 trillion tokens across 119 languages, with a focus on high-quality data including coding, STEM, reasoning, and multilingual content.
Architectural Refinements: Incorporates advanced training techniques like global-batch load balancing loss and qk layernorm for improved stability and performance.
Three-stage Pre-training: A structured approach that first establishes general knowledge, then enhances reasoning skills (STEM, coding, logical reasoning), and finally extends long-context comprehension up to 32,768 tokens.
Scaling Law Guided Tuning: Hyperparameters are systematically tuned across the pre-training pipeline for optimal training dynamics and performance.

Model Specifications

Type: Causal Language Model
Parameters: 8.2 billion (6.95 billion non-embedding)
Context Length: 32,768 tokens
Layers: 36

Further Information

For detailed evaluation results and technical insights, refer to the official Qwen3 blog and GitHub repository.

Overview

Overview

Key Capabilities & Improvements

Model Specifications

Further Information

Full Model Card (README)