deepseek-ai/DeepSeek-V4-Flash

Hugging Face
TEXT GENERATIONConcurrency Cost:4Model Size:158BQuant:FP8Ctx Length:32kPublished:Apr 22, 2026License:mitArchitecture:Transformer0.8K Open Weights Warm

DeepSeek-AI's DeepSeek-V4-Flash is a 284 billion total parameter (13 billion activated) Mixture-of-Experts (MoE) language model, part of the DeepSeek-V4 series. It supports an extensive one million token context length, leveraging a hybrid attention architecture for improved long-context efficiency. This model is designed for highly efficient long-context intelligence, offering strong reasoning capabilities, especially when utilizing its 'Think Max' mode.

Loading preview...

DeepSeek-V4-Flash: Efficient Million-Token Context Intelligence

DeepSeek-V4-Flash, developed by DeepSeek-AI, is a Mixture-of-Experts (MoE) language model featuring 284 billion total parameters with 13 billion activated parameters. A key differentiator is its support for an impressive one million token context length, achieved through a novel hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). This design significantly improves long-context efficiency, reducing single-token inference FLOPs and KV cache requirements compared to previous versions.

Key Capabilities & Innovations

  • Million-Token Context: Handles extremely long inputs and outputs, ideal for complex document analysis or extended conversations.
  • Hybrid Attention Architecture: Optimizes efficiency for long contexts, making it practical for high-throughput applications.
  • Manifold-Constrained Hyper-Connections (mHC): Enhances signal propagation stability across model layers.
  • Muon Optimizer: Contributes to faster convergence and greater training stability during pre-training on over 32 trillion tokens.
  • Reasoning Effort Modes: Offers 'Non-think', 'Think High', and 'Think Max' modes, allowing users to balance speed and reasoning depth. The 'Think Max' mode, while requiring a larger thinking budget, enables the model to achieve reasoning performance comparable to the larger Pro version.

When to Use DeepSeek-V4-Flash

  • Long-Context Applications: Ideal for tasks requiring understanding or generation over very long documents, codebases, or conversations.
  • Efficient Inference: Its optimized architecture makes it suitable for scenarios where long-context processing needs to be efficient.
  • Complex Reasoning Tasks: When paired with the 'Think Max' mode, it can tackle challenging reasoning and agentic tasks, bridging the gap with larger models.
  • Resource-Constrained Environments: As the smaller model in the V4 series, its 13B activated parameters offer a balance of performance and computational efficiency.