The AI engineering stack we built internally — on the platform we ship

Summary

Cloudflare has deployed an internal AI engineering stack built entirely on its own production infrastructure, including AI Gateway, Workers AI, and Cloudflare Access. This architecture supports over 3,600 internal users and manages hundreds of billions of tokens for agentic coding workflows, automated code reviews, and security auditing.

Key Points

The infrastructure processes 241.37 billion tokens through AI Gateway and 51.83 billion tokens via Workers AI per month.
Authentication and zero-trust policy enforcement are handled via Cloudflare Access, which issues signed JWTs for all subsequent provider requests.
A centralized proxy Worker, built using the Hono framework, manages LLM routing, cost tracking, and Zero Data/Data Retention (ZDR) controls.
The system utilizes a discovery mechanism via a .well-known/opencode endpoint to automatically configure providers, MCP servers, agents, and permissions.
Workers AI is used for large-scale inference tasks, such as a security agent processing 7 billion tokens daily, which is 77% cheaper than using mid-tier proprietary models.
The stack implements sandboxed execution for agent-generated code using Dynamic Workers and manages stateful, long-running agent sessions via the Agents SDK (Durable Objects).

Technical Details

The architecture relies on a proxy Worker acting as a centralized control plane. When an AI client (such as OpenCode) sends a request, the Worker intercepts it to rewrite headers: it strips the client-side authorization, cf-access-token, and host headers, and injects cf-aig-authorization: Bearer <API_KEY> and cf-aig-metadata: {"userId": "<anonymous-uuid>"}. This design ensures that sensitive provider API keys are never stored on developer machines and that all requests are routed through AI Gateway for governance. To maintain privacy, the Worker maps Cloudflare Access email identities to anonymous UUIDs using D1 for persistent storage and KV for caching, ensuring model providers only see anonymous identifiers.

The configuration is managed as code, where agents and commands are authored in Markdown with YAML frontmatter and compiled into a single JSON configuration validated against a specific schema. The system also maintains a dynamic model catalog by using an hourly cron trigger to fetch updated model lists from models.dev, storing them in Workers KV, and injecting store: false to enforce Zero Data Retention. For the Model Context Protocol (MCP) implementation, the stack uses an MCP Server Portal built on Workers and Access, allowing for standardized, authenticated access to internal tools and resources.

Impact / Why It Matters

This architecture provides a blueprint for using a single proxy pattern to implement a scalable, secure, and cost-effective control plane for distributed AI clients. It demonstrates how developers can centralize LLM governance, cost attribution, and identity-based access control without requiring client-side configuration changes.

The AI engineering stack we built internally — on the platform we ship

The AI engineering stack we built internally — on the platform we ship

Summary

Key Points

Technical Details

Impact / Why It Matters

↳ Sources