Qwen/Qwen3-14B-Base
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:14BQuant:FP8Ctx Length:32kPublished:Apr 28, 2025License:apache-2.0Architecture:Transformer0.1K Open Weights Warm

Qwen3-14B-Base is a 14.8 billion parameter causal language model developed by Qwen, pre-trained on 36 trillion tokens across 119 languages. This model features a three-stage pre-training process focusing on broad language modeling, reasoning skills (STEM, coding), and long-context comprehension up to 32k tokens. It incorporates architectural refinements like qk layernorm and scaling law-guided hyperparameter tuning for improved stability and performance. Qwen3-14B-Base is designed for general knowledge acquisition and advanced reasoning tasks.

Loading preview...

Qwen3-14B-Base Overview

Qwen3-14B-Base is a 14.8 billion parameter pre-trained causal language model from the Qwen series, building upon advancements in training data, architecture, and optimization. It features a substantial expansion in its pre-training corpus, now encompassing 36 trillion tokens across 119 languages, significantly tripling the language coverage of its predecessor, Qwen2.5. The dataset includes a rich mix of high-quality data, such as coding, STEM, reasoning, and multilingual content.

Key Improvements & Features

  • Expanded Pre-training Corpus: Trained on 36 trillion tokens across 119 languages, with a focus on high-quality data for coding, STEM, and reasoning.
  • Architectural Refinements: Incorporates training techniques like global-batch load balancing loss for MoE models and qk layernorm for all models, enhancing stability and performance.
  • Three-stage Pre-training: A structured approach that first builds general language modeling, then refines reasoning skills (STEM, coding, logical reasoning), and finally extends long-context comprehension up to 32,768 tokens.
  • Scaling Law Guided Tuning: Hyperparameters are systematically tuned using scaling law studies across the pre-training pipeline for optimal training dynamics and performance.

Model Specifications

  • Parameters: 14.8 billion (13.2 billion non-embedding)
  • Context Length: 32,768 tokens
  • Layers: 40
  • Attention Heads (GQA): 40 for Q, 8 for KV

For detailed evaluation results and further information, refer to the official Qwen3 blog and GitHub repository.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p