DeepSeek-V4: a million-token context that agents can actually use
Summary
DeepSeek-V4 introduces a 1-million-token context window optimized specifically for long-running agentic workloads. The architecture utilizes a hybrid attention mechanism to significantly reduce KV cache memory and inference FLOPs, enabling more efficient multi-step tool-use and reasoning.
Key Points
- DeepSeek-V4-Pro requires 27% of the single-token inference FLOPs and 10% of the KV cache memory compared to DeepSeek-V3.2.
- The architecture employs a hybrid attention mechanism alternating between Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) across its 61-layer stack.
- The KV cache size is reduced to approximately 2% of a standard GQA-based bfloat16 architecture.
- A new XML-based tool-call format using the
|DSML|token separates string parameters (string="true") from structured JSON parameters (string="false") to prevent parsing errors. - The model preserves reasoning traces across user message boundaries specifically when the conversation contains tool calls.
- Benchmark performance includes a 67.9 score on Terminal Bench 2.0, 80.6 resolved tasks on SWE Verified, and 73.6 on MCPAtlas Public.
- Available model variants include DeepSeek-V4-Pro (1.6T total / 49B activated) and DeepSeek-V4-Flash (284B total / 13B activated).
Technical Details
The efficiency of the 1M context window is achieved through a split-attention architecture. Compressed Sparse Attention (CSA) compresses KV entries by 4x using softmax-gated pooling with a learned positional bias, utilizing an FP4-based lightning indexer to select top-k compressed blocks. Heavily Compressed Attention (HCA) provides a 128x compression ratio, allowing for dense attention over a highly compressed stream. In the 61-layer V4-Pro stack, layers 0–1 use HCA, layers 2–60 alternate between CSA and HCA, and the final MTP block utilizes a sliding-window mechanism. Most KV entries are stored in FP8, with only RoPE dimensions using BF16.
For agentic workflows, V4 implements specialized post-training to maintain a coherent chain of thought. Unlike previous versions that discarded reasoning traces upon new user messages, V4 retains reasoning content across tool-result rounds and user message boundaries if tool calls are present. The tool-calling infrastructure is supported by DSec, a Rust-based platform that manages execution across function calls, containers, Firecracker microVMs, and QEMU VMs. The model supports three reasoning modes: Non-think, Think High, and Think Max (the latter requiring a minimum context window of 384K tokens).
Impact / Why It Matters
Developers can deploy long-context agents with significantly lower GPU memory overhead and improved reliability during multi-turn tool-use tasks. However, integrating these models into existing workflows will require updating tool-calling harnesses to support the new |DSML| XML-based schema.