deepseek-ai/DeepSeek-V4-Pro

Hugging Face
TEXT GENERATIONConcurrency Cost:4Model Size:862BQuant:FP8Ctx Length:256kTool Calling:SupportedPublished:Apr 22, 2026License:mitArchitecture:Transformer5.1K Open Weights Warm

DeepSeek-AI's DeepSeek-V4-Pro is a 1.6 trillion parameter (49 billion activated) Mixture-of-Experts (MoE) language model designed for highly efficient million-token context intelligence. It features a hybrid attention architecture and Manifold-Constrained Hyper-Connections (mHC) for enhanced long-context processing and signal propagation stability. Pre-trained on over 32 trillion tokens, DeepSeek-V4-Pro excels in complex reasoning, agentic tasks, and coding benchmarks, offering advanced knowledge capabilities.

Loading preview...

DeepSeek-V4-Pro: Million-Token Context Intelligence

DeepSeek-V4-Pro, developed by DeepSeek-AI, is a powerful 1.6 trillion parameter (49 billion activated) Mixture-of-Experts (MoE) language model. It is specifically engineered for highly efficient processing of one million token contexts, a significant advancement in long-context intelligence.

Key Architectural Innovations

  • Hybrid Attention Architecture: Combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to drastically improve long-context efficiency, reducing single-token inference FLOPs by 73% and KV cache usage by 90% compared to DeepSeek-V3.2 in 1M-token settings.
  • Manifold-Constrained Hyper-Connections (mHC): Enhances residual connections for stable signal propagation across layers while maintaining model expressivity.
  • Muon Optimizer: Utilized for faster convergence and improved training stability.

Performance and Capabilities

DeepSeek-V4-Pro is pre-trained on over 32 trillion diverse and high-quality tokens. It employs a two-stage post-training pipeline involving domain-specific expert cultivation and unified model consolidation. The model supports three reasoning effort modes: 'Non-think' for fast responses, 'Think High' for conscious logical analysis, and 'Think Max' for pushing reasoning to its fullest extent. DeepSeek-V4-Pro-Max, its maximum reasoning effort mode, demonstrates top-tier performance in coding benchmarks and significantly bridges the gap with leading closed-source models on reasoning and agentic tasks.

Use Cases

  • Complex Reasoning: Excels in tasks requiring deep logical analysis and problem-solving.
  • Agentic Workflows: Strong performance in tasks involving planning and multi-step execution.
  • Coding: Achieves high scores in coding benchmarks like LiveCodeBench and Codeforces.
  • Long-Context Applications: Ideal for tasks requiring understanding and generation over extremely long documents or conversations, thanks to its 1M token context window.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p