★ 8/10 · Ai · 2026-04-24

DeepSeek-V4: a million-token context that agents can actually use

DeepSeek-V4 introduces a 1-million-token context window optimized specifically for long-running agentic workloads. The architecture utilizes a hybrid attention mechanism to significantly reduce KV cache memory and...

DeepSeek-V4: a million-token context that agents can actually use

Summary

DeepSeek-V4 introduces a 1-million-token context window optimized specifically for long-running agentic workloads. The architecture utilizes a hybrid attention mechanism to significantly reduce KV cache memory and inference FLOPs, enabling more efficient multi-step tool-use and reasoning.

Key Points

  • DeepSeek-V4-Pro requires 27% of the single-token inference FLOPs and 10% of the KV cache memory compared to DeepSeek-V3.2.
  • The architecture employs a hybrid attention mechanism alternating between Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) across its 61-layer stack.
  • The KV cache size is reduced to approximately 2% of a standard GQA-based bfloat16 architecture.
  • A new XML-based tool-call format using the |DSML| token separates string parameters (string="true") from structured JSON parameters (string="false") to prevent parsing errors.
  • The model preserves reasoning traces across user message boundaries specifically when the conversation contains tool calls.
  • Benchmark performance includes a 67.9 score on Terminal Bench 2.0, 80.6 resolved tasks on SWE Verified, and 73.6 on MCPAtlas Public.
  • Available model variants include DeepSeek-V4-Pro (1.6T total / 49B activated) and DeepSeek-V4-Flash (284B total / 13B activated).

Technical Details

The efficiency of the 1M context window is achieved through a split-attention architecture. Compressed Sparse Attention (CSA) compresses KV entries by 4x using softmax-gated pooling with a learned positional bias, utilizing an FP4-based lightning indexer to select top-k compressed blocks. Heavily Compressed Attention (HCA) provides a 128x compression ratio, allowing for dense attention over a highly compressed stream. In the 61-layer V4-Pro stack, layers 0–1 use HCA, layers 2–60 alternate between CSA and HCA, and the final MTP block utilizes a sliding-window mechanism. Most KV entries are stored in FP8, with only RoPE dimensions using BF16.

For agentic workflows, V4 implements specialized post-training to maintain a coherent chain of thought. Unlike previous versions that discarded reasoning traces upon new user messages, V4 retains reasoning content across tool-result rounds and user message boundaries if tool calls are present. The tool-calling infrastructure is supported by DSec, a Rust-based platform that manages execution across function calls, containers, Firecracker microVMs, and QEMU VMs. The model supports three reasoning modes: Non-think, Think High, and Think Max (the latter requiring a minimum context window of 384K tokens).

Impact / Why It Matters

Developers can deploy long-context agents with significantly lower GPU memory overhead and improved reliability during multi-turn tool-use tasks. However, integrating these models into existing workflows will require updating tool-calling harnesses to support the new |DSML| XML-based schema.

AI LLM Software Engineering

↳ Sources