unsloth/Qwen3-14B-Base
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:14BQuant:FP8Ctx Length:32kPublished:Apr 28, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

Qwen3-14B-Base is a 14.8 billion parameter causal language model from the Qwen series, developed by Qwen. Pre-trained on 36 trillion tokens across 119 languages, it features an expanded, high-quality corpus and architectural refinements like qk layernorm. This base model excels in broad language modeling, general knowledge acquisition, and reasoning skills, with a 32,768 token context length.

Loading preview...

Qwen3-14B-Base Overview

Qwen3-14B-Base is a 14.8 billion parameter causal language model, part of the latest Qwen series. Developed by Qwen, this model builds upon advancements in training data, architecture, and optimization techniques, offering significant improvements over its predecessors. It features a 32,768 token context length, enabling robust long-context comprehension.

Key Capabilities and Improvements

  • Expanded Pre-training Corpus: Trained on an extensive 36 trillion tokens covering 119 languages, tripling the language coverage of Qwen2.5. The corpus includes a rich mix of high-quality data such as coding, STEM, reasoning, and multilingual content.
  • Architectural Refinements: Incorporates advanced training techniques and architectural improvements, including qk layernorm, enhancing stability and overall performance.
  • Three-stage Pre-training: Employs a structured pre-training approach:
    • Stage 1: Focuses on broad language modeling and general knowledge.
    • Stage 2: Improves reasoning skills, including STEM, coding, and logical reasoning.
    • Stage 3: Extends training sequence lengths up to 32k tokens for enhanced long-context comprehension.
  • Optimized Hyperparameter Tuning: Utilizes scaling law studies to systematically tune critical hyperparameters for improved training dynamics and performance across different model scales.

Model Specifications

  • Type: Causal Language Model
  • Training Stage: Pretraining
  • Parameters: 14.8 billion (13.2 billion non-embedding)
  • Layers: 40
  • Attention Heads (GQA): 40 for Q, 8 for KV
  • Context Length: 32,768 tokens

For detailed evaluation results and further information, refer to the Qwen3 blog and GitHub repository.