Kubernetes v1.36: Staleness Mitigation and Observability for Controllers
Summary
Kubernetes v1.36 introduces new features designed to mitigate controller staleness and improve the observability of controller behavior. These updates allow controllers to detect when their local cache is outdated relative to recent API server writes, preventing incorrect reconciliation actions caused by an inconsistent view of the cluster state.
Key Points
- AtomicFIFO Feature Gate: A new
AtomicFIFOfeature gate inclient-goenables atomic processing of batch operations, ensuring the queue remains consistent even when events arrive out of order during initial list operations. - Cache Introspection: The
Storeinterface inclient-gonow includes theLastStoreSyncResourceVersion()function, allowing clients to determine the latest resource version seen by the controller cache. - Controller-Manager Updates: The DaemonSet, StatefulSet, ReplicaSet, and Job controllers in
kube-controller-managernow utilize staleness mitigation by default. - Feature Gate Control: Staleness mitigation for specific controllers can be disabled using the
StaleControllerConsistency<API type>feature gate (e.g.,StaleControllerConsistencyDaemonSet). - ConsistencyStore Interface: A new
ConsistencyStoreinterface for informer authors providesWroteAt,EnsureReady, andClearmethods to track and verify resource versions. - New Observability Metrics: New alpha metrics include
stale_sync_skips_totalto track skipped reconciliations andstore_resource_versionto monitor the latest resource version of shared informers.
Technical Details
The staleness mitigation mechanism functions by comparing the latest resource version present in the controller's cache against the resource version of the last object the controller successfully wrote to the API server. If the cache's resource version is lower than the version of the last write, the controller identifies the cache as stale and skips the reconciliation loop to avoid acting on outdated information. This prevents the "incorrect action" pattern where a controller might revert a change because it has not yet seen the update in its local cache.
For developers implementing custom informers, the ConsistencyStore interface provides the primitives necessary to implement "read-your-own-writes" semantics. The WroteAt method records the resource version of an object after a write operation, while EnsureReady checks if the cache has reached that specific version. To prevent memory leaks in the consistency store, the Clear method should be used when an object is deleted. Additionally, client-go now emits store_resource_version metrics, which include Group, Version, and Resource labels, allowing operators to compare informer state directly against the API server's state.
Impact / Why It Matters
These improvements reduce the frequency of controllers taking incorrect or delayed actions due to cache lag, increasing the overall stability of the cluster. Developers can use the new client-go primitives to build more robust controllers that inherently handle cache inconsistency.