Building the foundation for running extra-large language models

Summary

Cloudflare has implemented several architectural optimizations to the Workers AI infrastructure to support the deployment of extra-large language models, such as Kimi K2.5. These updates focus on decoupling compute-intensive and memory-intensive inference stages to improve latency and throughput for agentic workloads.

Key Points

Achieved a 3x improvement in intertoken latency, reducing p90 time per token from approximately 100ms to 20-30ms.
Implemented Prefill-Decode (PD) disaggregation, separating the compute-bound prefill stage from the memory-bound decode stage onto independent inference servers.
Integrated NVIDIA’s EAGLE-3 draft model to utilize speculative decoding, accelerating token generation by predicting multiple candidate tokens in a single forward pass.
Introduced the x-session-affinity header to facilitate prompt caching, which increased input token cache hit ratios from 60% to 80% during peak periods.
Leveraged the Mooncake Transfer Engine and Mooncake Store to enable high-performance KV cache sharing across multiple GPUs using RDMA protocols like NVLink and NVMe over Fabric.
Updated the proprietary Rust-based inference engine, Infire, to support multi-GPU configurations required for models with weights exceeding 560GB.

Technical Details

The core of the optimization lies in PD disaggregation and token-aware load balancing. By running separate servers for the prefill stage (which populates the KV cache) and the decode stage (which generates tokens), Cloudflare can tune hardware configurations independently for input-heavy or output-heavy traffic. The load balancer manages this complexity by rewriting responses—including streaming SSE—to include cached token information from the prefill server and estimating in-flight token counts to distribute load evenly across the pool.

For models like Kimi K2.5, which require at least 8 H100 GPUs to accommodate 560GB of model weights and the necessary KV cache, Cloudflare utilizes the Mooncake Transfer Engine. This framework uses RDMA to facilitate direct memory-to-memory data transfers without CPU involvement. When paired with LMCache or SGLang HiCache, the KV cache is shared across the cluster, allowing nodes to reuse cached tensors from previous requests. Furthermore, the Mooncake Store extends the cache beyond GPU VRAM to NVMe storage, increasing the duration sessions remain cached and improving overall request throughput.

Impact / Why It Matters

Developers can achieve significantly lower latency and higher throughput for complex agentic workflows by leveraging the x-session-affinity header for prompt caching. These infrastructure improvements allow for the efficient execution of trillion-parameter models within a distributed global network.

Building the foundation for running extra-large language models

Building the foundation for running extra-large language models

Summary

Key Points

Technical Details

Impact / Why It Matters

↳ Sources