CCCCCyx/Qwen3-8B-onpolicy-profiling-gasd-20260425_153824
The CCCCCyx/Qwen3-8B-onpolicy-profiling-gasd-20260425_153824 model is an 8.2 billion parameter causal language model from the Qwen3 series, developed by Qwen. It features a 32,768 token context length and is pre-trained on an expanded corpus of 36 trillion tokens across 119 languages, with a focus on high-quality data including coding, STEM, and reasoning. This model incorporates architectural refinements like qk layernorm and a three-stage pre-training process to enhance general knowledge, reasoning skills, and long-context comprehension.
Loading preview...
Qwen3-8B-Base Overview
This model, part of the Qwen3 series by Qwen, is an 8.2 billion parameter causal language model pre-trained with a substantial 32,768 token context length. It represents an advancement over previous Qwen iterations, focusing on an expanded and higher-quality pre-training corpus. The training data now encompasses 36 trillion tokens across 119 languages, significantly increasing multilingual coverage and including a richer mix of specialized data such as coding, STEM, reasoning, and synthetic content.
Key Capabilities & Features
- Expanded Pre-training Corpus: Utilizes 36 trillion tokens across 119 languages, tripling language coverage and enhancing data quality for diverse tasks.
- Architectural Refinements: Incorporates training techniques like global-batch load balancing for MoE models and qk layernorm for all models, improving stability and performance.
- Three-stage Pre-training: A structured approach that first builds general language modeling and knowledge, then refines reasoning skills (STEM, coding), and finally extends long-context comprehension up to 32k tokens.
- Scaling Law Guided Tuning: Hyperparameters are systematically tuned for both dense and MoE models across the pre-training stages, optimizing training dynamics and final performance.
Good For
- Applications requiring strong multilingual capabilities across 119 languages.
- Tasks benefiting from enhanced reasoning, STEM, and coding understanding.
- Use cases demanding long-context comprehension up to 32,768 tokens.
- Developers seeking a robust base model for further fine-tuning on specialized tasks.