kubernetes: kube-scheduler preemption logic deadlock when it evicts a Pod with a volume to make room for the CSI driver Pod
What happened?
We managed to reproduce and verify the following deadlock that occurs when kube-scheduler evicts a Pod (with less priority) to make room for a CSI driver Pod.
- New csi-driver-node-disk Pod has to be created for Node
worker-1
(caused by autoscaling event like scale-up or Pod lifecycle event). Pod csi-driver-node-disk has PriorityClasssystem-node-critical
(2000001000
priority class value). - Node
worker-1
is full and kube-scheduler has to activate the preemption logic (ref https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/) to make room for Pod csi-driver-node-disk. At this time Pod csi-driver-node-disk is in Pending state. - kube-scheduler evicts a Pod with a volume. But the corresponding Pod deletion hangs forever in Terminating as there is no running
csi-driver-node-disk
Pod for the given Node to perform the unmount operations.
There is no automatic recovery from this case -> csi-driver-node-disk Pod stays forever in Pending, the evicted Pod stays forever in Terminating.
See below for more detailed steps to reproduce.
What did you expect to happen?
No such deadlock to be possible to be caused in the system
How can we reproduce it (as minimally and precisely as possible)?
- Prepare a Node that is almost full (cpu or memory). Make sure that all of the running workload on this Pod has volumes
$ k describe no worker-1
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1887m (98%) 100m (5%)
memory 1169187818 (39%) 14972Mi (532%)
-
Increase the
csi-driver-node-disk
DaemonSet cpu requests high enough to make sure that it won’t fit on the Node (simulate a scale-up event by autoscaler) -
Make sure that the new csi-driver-node-disk Pod fails to be scheduled with:
$ k -n kube-system describe po csi-driver-node-disk-4m7vz
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 10m default-scheduler 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity/selector.
Warning FailedScheduling 8m56s (x1 over 10m) default-scheduler 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity/selector.
The csi-driver-node-disk Pod has PriorityClass system-node-critical
.
- Make sure that kube-scheduler tries to evict 1 of the Pods with lower priority
$ k get po
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 21m
web-1 1/1 Running 0 21m
web-2 0/1 Terminating 0 21m
- Make sure the deletion of this web-2 Pod (that has a volume) hangs forever because there is no running csi-driver-node-disk Pod for this Node to perform the unmount operations.
kubelet log:
E0617 11:33:53.547111 3074 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/disk.csi.azure.com^/subscriptions/<omitted>/resourceGroups/shoot--foo--bar/providers/Microsoft.Compute/disks/pv-shoot--foo--bar-c2561535-78a0-432e-9364-681c36b4d674 podName:ba6a8d0c-6018-436a-a7cb-83cb61341537 nodeName:}" failed. No retries permitted until 2022-06-17 11:35:55.547035965 +0000 UTC m=+13631.726363595 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"www\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/<omitted>/resourceGroups/shoot--foo--bar/providers/Microsoft.Compute/disks/pv-shoot--foo--bar-c2561535-78a0-432e-9364-681c36b4d674\") pod \"ba6a8d0c-6018-436a-a7cb-83cb61341537\" (UID: \"ba6a8d0c-6018-436a-a7cb-83cb61341537\") : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name disk.csi.azure.com not found in the list of registered CSI drivers"
Anything else we need to know?
No response
Kubernetes version
I confirmed the issue with v1.21.10 but it should be possible to reproduce it with newer K8s versions as well.
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
For this issue description azuredisk-csi-driver but it should be possible to reproduce it with any similar CSI driver.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 1
- Comments: 28 (24 by maintainers)
I have test it on 1.27+, and the problem does not occur