unsloth/Qwen3-8B-Base

Warm
Public
8B
FP8
32768
License: apache-2.0
Hugging Face
Overview

Qwen3-8B-Base Overview

Qwen3-8B-Base is an 8.2 billion parameter causal language model from the Qwen3 series, developed by Qwen. This base model is pre-trained and incorporates significant advancements over its predecessor, Qwen2.5, focusing on enhanced data quality, architectural improvements, and optimized training methodologies.

Key Capabilities & Features

  • Expanded Pre-training Corpus: Trained on an extensive 36 trillion tokens across 119 languages, significantly increasing language coverage and data quality, including coding, STEM, reasoning, and multilingual data.
  • Architectural Refinements: Integrates advanced training techniques and architectural improvements such as qk layernorm for improved stability and performance.
  • Three-stage Pre-training: Utilizes a multi-stage training approach, initially focusing on broad language modeling, then enhancing reasoning skills (STEM, coding), and finally extending long-context comprehension up to 32,768 tokens.
  • Optimized Hyperparameter Tuning: Benefits from comprehensive scaling law studies to systematically tune hyperparameters for better training dynamics and performance across different model scales.
  • Technical Specifications: Features 8.2 billion parameters (6.95B non-embedding), 36 layers, and a context length of 32,768 tokens.

Good for

  • Applications requiring strong multilingual understanding and generation.
  • Tasks demanding advanced reasoning, STEM problem-solving, and code generation.
  • Use cases benefiting from long-context processing and comprehension.

For detailed evaluation results and further information, refer to the official Qwen3 blog and GitHub repository.