[AINews] DeepSeek V4 Pro (1.6T-A49B) and Flash (284B-A13B), Base and Instruct — runnable on Huawei Ascend chips
Summary
DeepSeek has released the V4 model family, consisting of DeepSeek V4 Pro and DeepSeek V4 Flash, marking a significant architectural update to the series. The release introduces a 1M token context window and advanced attention mechanisms designed to drastically reduce memory and computational overhead.
Key Points
- Model Architectures: DeepSeek V4 Pro features 1.6T total parameters (49B active), while DeepSeek V4 Flash features 284B total parameters (13B active).
- Context Window: The context capacity has expanded to 1M tokens, supported by Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA).
- Training Scale: The models were trained on approximately 32T–33T tokens using mixed FP4 and FP8 precision.
- Efficiency Gains: At a 1M token scale, the architecture requires only 27% of the FLOPs and 10% of the KV cache memory compared to DeepSeek V3.2.
- Performance Benchmarks: V4 Pro achieved a score of 52 on the Artificial Analysis Intelligence Index, positioning it as the #2 open-weights reasoning model.
- Licensing and Hardware: The models are released under an MIT license and feature compatibility with Huawei Ascend chips.
Technical Details
The V4 architecture implements a hybrid attention system designed for ultra-long context efficiency. This system utilizes shared KV vectors, compressed KV streams, and sparse attention over compressed tokens, integrated with a 128-token sliding window. These innovations, specifically the CSA and HCA layers, allow the 1M token context KV cache to be reduced to 9.62 GiB/sequence (in bf16), an 8.7x reduction compared to DeepSeek V3.2.
The model checkpoints utilize a mixed-precision format: MoE expert weights are quantized in FP4, while the attention, normalization, and router layers utilize FP8. This precision strategy enables the full V4 Pro model to fit within a single 8×B200 node. The release includes both Base and Instruct versions, facilitating both direct deployment and further fine-tuning for specialized tasks.
Impact / Why It Matters
The significant reduction in KV cache memory and FLOP requirements allows for the deployment of high-intelligence, long-context models on much smaller hardware footprints. For developers, the combination of an MIT license and highly competitive token pricing ($0.14–$3.48 per 1M tokens) enables the construction of complex, agentic workflows and large-scale document processing pipelines with reduced operational costs.