CCCCCyx/Qwen3-8B-onpolicy-profiling-adam-20260403_091551

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 29, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

Qwen3-8B-Base is an 8.2 billion parameter causal language model developed by Qwen, pre-trained on 36 trillion tokens across 119 languages. This model incorporates architectural refinements like qk layernorm and a three-stage pre-training process to enhance reasoning, coding, and long-context comprehension up to 32,768 tokens. It is designed for broad language modeling and general knowledge acquisition, with a focus on improved stability and performance.

Loading preview...

Qwen3-8B-Base: An Overview

Qwen3-8B-Base is a pre-trained causal language model from the Qwen series, featuring 8.2 billion parameters and a context length of 32,768 tokens. Developed by Qwen, this model represents the latest generation, building upon advancements in training data, architecture, and optimization techniques.

Key Enhancements and Capabilities

Qwen3-8B-Base distinguishes itself through several key improvements over previous iterations:

  • Expanded High-Quality Pre-training Corpus: Trained on an extensive 36 trillion tokens covering 119 languages, significantly broadening its linguistic and knowledge base. The corpus includes a rich mix of coding, STEM, reasoning, book, multilingual, and synthetic data.
  • Advanced Training Techniques: Incorporates architectural refinements such as qk layernorm and global-batch load balancing loss for MoE models, contributing to enhanced stability and overall performance.
  • Three-Stage Pre-training: Utilizes a structured pre-training approach:
    • Stage 1: Focuses on general language modeling and knowledge acquisition.
    • Stage 2: Improves specialized reasoning skills, including STEM, coding, and logical reasoning.
    • Stage 3: Enhances long-context comprehension by extending training sequence lengths up to 32k tokens.
  • Scaling Law Guided Hyperparameter Tuning: Critical hyperparameters were systematically tuned across the three-stage pipeline for optimal training dynamics and performance.

Model Specifications

  • Type: Causal Language Model
  • Training Stage: Pretraining
  • Parameters: 8.2 Billion (6.95 Billion non-embedding)
  • Layers: 36
  • Attention Heads (GQA): 32 for Q, 8 for KV
  • Context Length: 32,768 tokens

For detailed evaluation results and further technical information, refer to the official Qwen3 blog and GitHub repository.