deepseek-ai/DeepSeek-V4-Pro

Hugging Face
TEXT GENERATIONConcurrency Cost:4Model Size:862BQuant:FP8Ctx Length:32kPublished:Apr 22, 2026License:mitArchitecture:Transformer3.1K Open Weights Warm

DeepSeek-V4-Pro is a 1.6 trillion parameter (49 billion activated) Mixture-of-Experts (MoE) language model developed by DeepSeek-AI, supporting an extensive one million token context length. It features a hybrid attention architecture and Manifold-Constrained Hyper-Connections (mHC) for improved long-context efficiency and signal propagation stability. Pre-trained on over 32 trillion tokens, this model excels in complex reasoning, coding benchmarks, and agentic tasks, aiming to bridge the gap with leading closed-source models.

Loading preview...

DeepSeek-V4-Pro: Million-Token Context MoE Model

DeepSeek-V4-Pro, developed by DeepSeek-AI, is a powerful 1.6 trillion parameter (49 billion activated) Mixture-of-Experts (MoE) language model designed for highly efficient long-context intelligence. A standout feature is its support for an impressive one million token context length, achieved through a novel hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). This architecture dramatically reduces inference FLOPs and KV cache requirements compared to previous versions.

Key Capabilities & Innovations

  • Extended Context Efficiency: Optimized for 1M token contexts, requiring significantly less computational overhead.
  • Enhanced Stability: Incorporates Manifold-Constrained Hyper-Connections (mHC) for robust signal propagation.
  • Advanced Training: Pre-trained on over 32 trillion diverse tokens, utilizing a two-stage post-training pipeline with domain-specific experts and on-policy distillation.
  • Reasoning Modes: Offers 'Non-think', 'Think High', and 'Think Max' modes, allowing users to control the depth of logical analysis, with 'Think Max' pushing the model's reasoning capabilities to their fullest extent.
  • Top-tier Performance: DeepSeek-V4-Pro-Max demonstrates strong performance across coding, reasoning, and agentic benchmarks, often rivaling or surpassing other frontier models.

Ideal Use Cases

  • Complex Problem Solving: Excels in scenarios requiring deep logical analysis and multi-step reasoning.
  • Long-Context Applications: Suited for tasks involving extensive documents, codebases, or conversational histories up to 1 million tokens.
  • Code Generation & Agentic Workflows: Achieves high scores in coding benchmarks and agentic tasks, making it valuable for development and automation.
  • Knowledge-Intensive Tasks: Bridges the gap with leading closed-source models on various knowledge and reasoning benchmarks.