kubernetes: Regression? Deleting Pod marked as terminated while volumes are being unmounted

What happened?

In k8s 1.23

pod is running which mounts some volumes
stop csi-driver
delete pod using kubectl delete pod xxx
pod cannot be deleted completely because of unmounted volume
kube-scheduler cannot execute deletePodFromCache method to clear pod from node info https://github.com/kubernetes/kubernetes/blob/1635c380b26a1d8cc25d36e9feace9797f4bae3c/pkg/scheduler/eventhandlers.go#L223

In k8s 1.27

pod is running which mounts some volumes
stop csi-driver
delete pod using kubectl delete pod xxx
pod cannot be deleted completely because of unmounted volume
But kube-scheduler can execute deletePodFromCache method to clear pod from node info https://github.com/kubernetes/kubernetes/blob/bdfb880a198495e94a9092575d160936aa42b824/pkg/scheduler/eventhandlers.go#L215

This phenomenon looks like kube-scheduler starting to consider terminating pod. What improvements have we made to kube-scheduler in versions 1.23 to 1.27?

What did you expect to happen?

1.27 kube-scheduler cannot execute deletePodFromCache before pod deleted completely

How can we reproduce it (as minimally and precisely as possible)?

above reproduce actions

Anything else we need to know?

No response

Kubernetes version

1.27

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

Original URL
State: open
Created 9 months ago
Comments: 29 (22 by maintainers)

Most upvoted comments

@Chaunceyctx I think it would be good to understand what concrete issue you’re running into as a result of this / if there some downstream effects.

bobbypage on Oct 12, 2023

I chatted with @mimowo offline and we looked a bit more closely at this. I believe this is expected behavior and is consistent with the behavior before 1.27.

In 1.27:

For a pod to enter the terminal phase SyncTerminatingPod is expected to complete, which is responsible to terminate all running containers. After all containers are terminated, we will generate the terminal phase (Succeeded or Failed). The status_manager explicably checks that the pod has no running containers before it updates the phase to terminal to the api server.
For a pod to be deleted by kubelet, the pod must have completed SyncTerminatedPod which unmounts volumes (and a few other cleanups) - https://github.com/kubernetes/kubernetes/blob/2b4ef19/pkg/kubelet/kubelet.go#L2136-L2149. Only after volumes are unmounted will kl.statusManager.TerminatePod(pod) be called which in turns calls updateStatusInternal (podIsFinished=true) which is used a signal and check that the pod can be deleted.

The same behavior was true in 1.26:

Terminal phase was gated on all containers being terminal for the pod - https://github.com/kubernetes/kubernetes/blob/release-1.26/pkg/kubelet/status/status_manager.go#L928
Deletion was blocked based on PodResourcesAreReclaimed which checks if volumes are unmounted - https://github.com/kubernetes/kubernetes/blob/release-1.26/pkg/kubelet/kubelet_pods.go#L949-L965

As a result, this doesn’t look like a behavior change. The behavior has been that a pod can enter terminal phase before volumes are fully detached. However, for a pod to be deleted successfully, the volumes must have been unmounted. I believe this has been done to reduce latency for pods (to ensure we don’t wait for volumes to detach), and the fact that the kubelet does not “own” the volume resources (as mentioned by the comment - https://github.com/kubernetes/kubernetes/issues/120917#issuecomment-1759417824), so this can be asynchronously after the pod is already in a terminal phase.

bobbypage on Oct 12, 2023