kubernetes: kube-scheduler preemption logic deadlock when it evicts a Pod with a volume to make room for the CSI driver Pod

What happened?

We managed to reproduce and verify the following deadlock that occurs when kube-scheduler evicts a Pod (with less priority) to make room for a CSI driver Pod.

New csi-driver-node-disk Pod has to be created for Node worker-1 (caused by autoscaling event like scale-up or Pod lifecycle event). Pod csi-driver-node-disk has PriorityClass system-node-critical (2000001000 priority class value).
Node worker-1 is full and kube-scheduler has to activate the preemption logic (ref https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/) to make room for Pod csi-driver-node-disk. At this time Pod csi-driver-node-disk is in Pending state.
kube-scheduler evicts a Pod with a volume. But the corresponding Pod deletion hangs forever in Terminating as there is no running csi-driver-node-disk Pod for the given Node to perform the unmount operations.

There is no automatic recovery from this case -> csi-driver-node-disk Pod stays forever in Pending, the evicted Pod stays forever in Terminating.

See below for more detailed steps to reproduce.

What did you expect to happen?

No such deadlock to be possible to be caused in the system

How can we reproduce it (as minimally and precisely as possible)?

Prepare a Node that is almost full (cpu or memory). Make sure that all of the running workload on this Pod has volumes

$ k describe no worker-1

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests          Limits
  --------           --------          ------
  cpu                1887m (98%)       100m (5%)
  memory             1169187818 (39%)  14972Mi (532%)

Increase the csi-driver-node-disk DaemonSet cpu requests high enough to make sure that it won’t fit on the Node (simulate a scale-up event by autoscaler)
Make sure that the new csi-driver-node-disk Pod fails to be scheduled with:

$ k -n kube-system describe po csi-driver-node-disk-4m7vz

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  10m                  default-scheduler  0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  8m56s (x1 over 10m)  default-scheduler  0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity/selector.

The csi-driver-node-disk Pod has PriorityClass system-node-critical.

Make sure that kube-scheduler tries to evict 1 of the Pods with lower priority

$ k get po
NAME    READY   STATUS        RESTARTS   AGE
web-0   1/1     Running       0          21m
web-1   1/1     Running       0          21m
web-2   0/1     Terminating   0          21m

Make sure the deletion of this web-2 Pod (that has a volume) hangs forever because there is no running csi-driver-node-disk Pod for this Node to perform the unmount operations.

kubelet log:

E0617 11:33:53.547111    3074 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/disk.csi.azure.com^/subscriptions/<omitted>/resourceGroups/shoot--foo--bar/providers/Microsoft.Compute/disks/pv-shoot--foo--bar-c2561535-78a0-432e-9364-681c36b4d674 podName:ba6a8d0c-6018-436a-a7cb-83cb61341537 nodeName:}" failed. No retries permitted until 2022-06-17 11:35:55.547035965 +0000 UTC m=+13631.726363595 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"www\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/<omitted>/resourceGroups/shoot--foo--bar/providers/Microsoft.Compute/disks/pv-shoot--foo--bar-c2561535-78a0-432e-9364-681c36b4d674\") pod \"ba6a8d0c-6018-436a-a7cb-83cb61341537\" (UID: \"ba6a8d0c-6018-436a-a7cb-83cb61341537\") : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name disk.csi.azure.com not found in the list of registered CSI drivers"

Anything else we need to know?

No response

Kubernetes version

I confirmed the issue with v1.21.10 but it should be possible to reproduce it with newer K8s versions as well.

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Gardener

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

For this issue description azuredisk-csi-driver but it should be possible to reproduce it with any similar CSI driver.

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 1
Comments: 28 (24 by maintainers)

Most upvoted comments

@ialidzhikov could you try reproducing the issue on 1.27+? It might be already fixed , see: #120917 (comment).

I have test it on 1.27+, and the problem does not occur

wackxu on Nov 16, 2023