Kubernetes v1.35: Introducing Workload Aware Scheduling
Summary
Kubernetes v1.35 introduces workload-aware scheduling features designed to manage multi-Pod applications, such as machine learning batch jobs, more efficiently. The update implements the new Workload API and gang scheduling to enable all-or-nothing Pod placement, alongside optimizations for scheduling identical Pods.
Key Points
- Introduces the
scheduling.k8s.io/v1alpha1API group to define structured scheduling requirements for multi-Pod applications. - Implements gang scheduling via the
GangSchedulingplugin, ensuring a group of Pods is only scheduled if theminCountrequirement is met. - Features "opportunistic batching" (Beta) to reduce scheduling latency by reusing feasibility calculations for Pods with identical resource requests, images, and affinities.
- Includes a 5-minute timeout for gang scheduling; if the full group cannot be assigned to nodes within this window, all Pods in the group are rejected and returned to the queue.
- Requires enabling the
GenericWorkloadfeature gate on bothkube-apiserverandkube-schedulerto utilize the Workload API. - The
OpportunisticBatchingfeature is enabled by default in v1.35 but can be managed via theOpportunisticBatchingfeature gate onkube-scheduler.
Technical Details
The Workload API allows users to define a Workload resource that specifies a podGroup and a scheduling policy. Pods are linked to this workload using the workloadRef and podGroup fields in their specification. The GangScheduling plugin manages the lifecycle of these Pods by blocking them from scheduling until the referenced Workload object exists and the number of pending Pods in the group meets the defined minCount. Once the minCount is reached, the scheduler uses a "Permit" gate to verify that valid assignments exist for the entire group. If the scheduler cannot find valid assignments for the whole group within five minutes, it rejects all Pods in that group to prevent resource deadlocks and wastage.
To optimize performance, the OpportunisticBatching feature identifies Pods that share identical scheduling criteria, such as container images, resource requests, and affinity rules. When the scheduler processes a Pod, it can reuse the feasibility calculations from previous identical Pods in the queue, significantly speeding up the scheduling process for large, identical workloads. However, this mechanism is disabled if any scheduling-relevant fields differ between Pods or if specific features are used that interfere with the batching logic.
Impact / Why It Matters
These updates reduce resource deadlocks and wastage in large-scale, multi-Pod workloads by ensuring that interdependent Pods are scheduled as a single unit. Additionally, the introduction of opportunistic batching provides a performance boost for high-scale, identical workloads without requiring manual configuration.