mihai-777/evolai-tfm-1p5b-04
Qwen3-1.7B-Base is a 1.7 billion parameter causal language model developed by Qwen, part of the Qwen3 series. Pre-trained on 36 trillion tokens across 119 languages, it features an expanded, higher-quality corpus and architectural refinements like qk layernorm. This model utilizes a three-stage pre-training process, including long-context comprehension up to 32k tokens, and is designed for broad language modeling and general knowledge acquisition.
Loading preview...
Qwen3-1.7B-Base Overview
Qwen3-1.7B-Base is a 1.7 billion parameter causal language model from the Qwen3 series, developed by Qwen. It builds upon previous generations with significant advancements in training data, model architecture, and optimization techniques. The model has 28 layers, 16 attention heads for Q, and 8 for KV, supporting a context length of 32,768 tokens.
Key Improvements & Features
- Expanded Pre-training Corpus: Trained on an extensive 36 trillion tokens across 119 languages, tripling the language coverage of its predecessor, Qwen2.5. This corpus includes a rich mix of high-quality data for coding, STEM, reasoning, and multilingual tasks.
- Architectural Refinements: Incorporates advanced training techniques and architectural improvements, such as global-batch load balancing loss for MoE models and qk layernorm for all models, enhancing stability and performance.
- Three-stage Pre-training: A structured approach where Stage 1 focuses on general language modeling, Stage 2 improves reasoning skills (STEM, coding), and Stage 3 extends long-context comprehension up to 32k tokens.
- Scaling Law Guided Tuning: Hyperparameters are systematically tuned for dense and MoE models through comprehensive scaling law studies, optimizing training dynamics and final performance.
Use Cases
This base model is suitable for applications requiring broad language understanding, general knowledge acquisition, and tasks benefiting from its extensive multilingual training and long-context capabilities. Developers can leverage its robust pre-training for various downstream applications.