From DeepSeek V3 to V3.2: Architecture, Sparse Attention, and RL Updates
Summary
DeepSeek has evolved its model architecture from the base DeepSeek V3 and the dedicated reasoning model DeepSeek R1 toward a hybrid architecture in the V3.1 and V3.2 series. This evolution includes the implementation of Multi-Head Latent Attention (MLA) for memory efficiency and the introduction of a non-standard sparse attention variant in recent experimental releases.
Key Points
- DeepSeek V3 Architecture: Utilizes a Mixture-of-Experts (MoE) framework combined with Multi-Head Latent Attention (MLA) to optimize KV cache memory usage.
- DeepSeek R1 Training: Employs Reinforcement Learning with Verifiable Rewards (RLVR) and the Group Relative Policy Optimization (GRPO) algorithm to enhance reasoning in domains like mathematics and coding.
- Hybrid Reasoning (V3.1): Transitioned from dedicated reasoning models to hybrid models that allow users to toggle between instruct and reasoning modes via prompt templates.
- Sparse Attention (V3.2-Exp): Introduced a non-standard sparse attention mechanism that requires custom inference code and updated deployment infrastructure.
- Performance Benchmarks: DeepSeek V3.2 is reported to achieve performance levels comparable to GPT-5 and Gemini 3.0 Pro.
Technical Details
The DeepSeek architecture relies heavily on Multi-Head Latent Attention (MLA) to reduce the memory footprint of the KV cache. MLA compresses key and value tensors into a lower-dimensional latent space before storage; during inference, these tensors are up-projected back to their original dimensions. While this adds a matrix multiplication step, the reduction in memory requirements significantly improves efficiency.
For reasoning capabilities, DeepSeek R1 utilizes RLVR, a method where the model learns from responses that can be programmatically or symbolically verified (e.g., through code execution or math solvers). This is implemented using the GRPO algorithm, a simplified variant of Proximal Policy Optimization (PPO).
The transition to the V3.2 series marks a shift toward hybrid models. Unlike the R1 model, which was a dedicated reasoning model, V3.1 and V3.2 allow for unified processing of general chat and reasoning tasks. However, the introduction of the sparse attention variant in the V3.2-Exp release represents a breaking change for standard transformer implementations, necessitating custom kernels or modified inference engines to handle the non-standard attention mechanism.
Impact / Why It Matters
Developers and DevOps engineers must implement custom inference infrastructure and updated kernels to support the sparse attention mechanism used in DeepSeek V3.2. The shift to hybrid models simplifies deployment pipelines by allowing a single model weight set to handle both standard instruction following and complex reasoning tasks.