deepseek-ai/DeepSeek-V4-Flash
DeepSeek-V4-Flash is a 284 billion total parameter (13 billion activated) Mixture-of-Experts (MoE) language model developed by DeepSeek-AI, supporting a one million token context length. It features a hybrid attention architecture and Manifold-Constrained Hyper-Connections for efficient long-context processing. Pre-trained on over 32 trillion tokens, DeepSeek-V4-Flash is designed for highly efficient intelligence across diverse tasks, offering strong reasoning capabilities, especially when given a larger thinking budget.
Loading preview...
DeepSeek-V4-Flash: Efficient Million-Token Context MoE
DeepSeek-V4-Flash, developed by DeepSeek-AI, is a powerful Mixture-of-Experts (MoE) language model with 284 billion total parameters and 13 billion activated parameters. A key differentiator is its support for an extensive one million token context length, enabling deep understanding and generation over very long inputs.
Key Architectural Innovations
This model incorporates several advancements for efficiency and performance:
- Hybrid Attention Architecture: Combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to significantly improve long-context efficiency, reducing single-token inference FLOPs and KV cache usage compared to previous versions.
- Manifold-Constrained Hyper-Connections (mHC): Enhances signal propagation stability across layers, contributing to overall model robustness.
- Muon Optimizer: Utilizes a novel optimizer for faster convergence and improved training stability.
Performance and Reasoning Modes
DeepSeek-V4-Flash was pre-trained on over 32 trillion diverse tokens and features a two-stage post-training pipeline. It offers three distinct reasoning effort modes:
- Non-think: For fast, intuitive responses to routine tasks.
- Think High: Engages conscious logical analysis for complex problem-solving.
- Think Max: Pushes reasoning to its fullest extent, achieving comparable reasoning performance to the larger Pro version with sufficient thinking budget.
While its smaller parameter scale places it slightly behind DeepSeek-V4-Pro on pure knowledge tasks, DeepSeek-V4-Flash demonstrates strong performance across various benchmarks, particularly in long-context understanding and reasoning when leveraging its 'Think Max' mode.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.