unsloth/Qwen3-0.6B-Base

Warm
Public
0.8B
BF16
40960
License: apache-2.0
Hugging Face
Overview

Qwen3-0.6B-Base Overview

Qwen3-0.6B-Base is a 0.6 billion parameter pre-trained causal language model, part of the latest Qwen3 series. Developed by Qwen, this model builds upon significant advancements in training data, model architecture, and optimization techniques compared to its predecessor, Qwen2.5. It features a substantial context length of 32,768 tokens.

Key Capabilities & Improvements

  • Expanded Pre-training Corpus: Trained on 36 trillion tokens across 119 languages, tripling the language coverage of Qwen2.5. The dataset includes a richer mix of high-quality data, such as coding, STEM, reasoning, and multilingual content.
  • Architectural Refinements: Incorporates advanced training techniques and architectural improvements, including qk layernorm for all models, enhancing stability and overall performance.
  • Three-stage Pre-training: The training process is structured in three stages: initial broad language modeling, followed by improved reasoning skills (STEM, coding), and finally enhanced long-context comprehension by extending sequence lengths up to 32k tokens.
  • Scaling Law Guided Tuning: Critical hyperparameters were systematically tuned using comprehensive scaling law studies across the pre-training pipeline, optimizing training dynamics and performance.

Model Specifications

  • Type: Causal Language Model
  • Training Stage: Pretraining
  • Parameters: 0.6 Billion (0.44 Billion non-embedding)
  • Layers: 28
  • Attention Heads (GQA): 16 for Q, 8 for KV
  • Context Length: 32,768 tokens

For detailed evaluation results and further information, refer to the Qwen3 blog and GitHub repository.