CCCCCyx/Qwen3-8B-onpolicy-profiling-gasd-20260425_153824

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Apr 29, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The CCCCCyx/Qwen3-8B-onpolicy-profiling-gasd-20260425_153824 model is an 8.2 billion parameter causal language model from the Qwen3 series, developed by Qwen. It features a 32,768 token context length and is pre-trained on an expanded corpus of 36 trillion tokens across 119 languages, with a focus on high-quality data including coding, STEM, and reasoning. This model incorporates architectural refinements like qk layernorm and a three-stage pre-training process to enhance general knowledge, reasoning skills, and long-context comprehension.

Loading preview...

Qwen3-8B-Base Overview

This model, part of the Qwen3 series by Qwen, is an 8.2 billion parameter causal language model pre-trained with a substantial 32,768 token context length. It represents an advancement over previous Qwen iterations, focusing on an expanded and higher-quality pre-training corpus. The training data now encompasses 36 trillion tokens across 119 languages, significantly increasing multilingual coverage and including a richer mix of specialized data such as coding, STEM, reasoning, and synthetic content.

Key Capabilities & Features

  • Expanded Pre-training Corpus: Utilizes 36 trillion tokens across 119 languages, tripling language coverage and enhancing data quality for diverse tasks.
  • Architectural Refinements: Incorporates training techniques like global-batch load balancing for MoE models and qk layernorm for all models, improving stability and performance.
  • Three-stage Pre-training: A structured approach that first builds general language modeling and knowledge, then refines reasoning skills (STEM, coding), and finally extends long-context comprehension up to 32k tokens.
  • Scaling Law Guided Tuning: Hyperparameters are systematically tuned for both dense and MoE models across the pre-training stages, optimizing training dynamics and final performance.

Good For

  • Applications requiring strong multilingual capabilities across 119 languages.
  • Tasks benefiting from enhanced reasoning, STEM, and coding understanding.
  • Use cases demanding long-context comprehension up to 32,768 tokens.
  • Developers seeking a robust base model for further fine-tuning on specialized tasks.