kubernetes: Regression? Deleting Pod marked as terminated while volumes are being unmounted
What happened?
In k8s 1.23
- pod is running which mounts some volumes
- stop csi-driver
- delete pod using
kubectl delete pod xxx
- pod cannot be deleted completely because of unmounted volume
- kube-scheduler cannot execute
deletePodFromCache
method to clear pod from node info https://github.com/kubernetes/kubernetes/blob/1635c380b26a1d8cc25d36e9feace9797f4bae3c/pkg/scheduler/eventhandlers.go#L223
In k8s 1.27
- pod is running which mounts some volumes
- stop csi-driver
- delete pod using
kubectl delete pod xxx
- pod cannot be deleted completely because of unmounted volume
But
kube-schedulercan
executedeletePodFromCache
method to clear pod from node info https://github.com/kubernetes/kubernetes/blob/bdfb880a198495e94a9092575d160936aa42b824/pkg/scheduler/eventhandlers.go#L215
This phenomenon looks like kube-scheduler starting to consider terminating pod. What improvements have we made to kube-scheduler in versions 1.23 to 1.27?
What did you expect to happen?
1.27 kube-scheduler cannot execute deletePodFromCache before pod deleted completely
How can we reproduce it (as minimally and precisely as possible)?
above reproduce actions
Anything else we need to know?
No response
Kubernetes version
1.27
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: open
- Created 9 months ago
- Comments: 29 (22 by maintainers)
@Chaunceyctx I think it would be good to understand what concrete issue you’re running into as a result of this / if there some downstream effects.
I chatted with @mimowo offline and we looked a bit more closely at this. I believe this is expected behavior and is consistent with the behavior before 1.27.
In 1.27:
For a pod to enter the terminal phase
SyncTerminatingPod
is expected to complete, which is responsible to terminate all running containers. After all containers are terminated, we will generate the terminal phase (Succeeded or Failed). The status_manager explicably checks that the pod has no running containers before it updates the phase to terminal to the api server.For a pod to be deleted by kubelet, the pod must have completed
SyncTerminatedPod
which unmounts volumes (and a few other cleanups) - https://github.com/kubernetes/kubernetes/blob/2b4ef19/pkg/kubelet/kubelet.go#L2136-L2149. Only after volumes are unmounted willkl.statusManager.TerminatePod(pod)
be called which in turns callsupdateStatusInternal (podIsFinished=true)
which is used a signal and check that the pod can be deleted.The same behavior was true in 1.26:
Terminal phase was gated on all containers being terminal for the pod - https://github.com/kubernetes/kubernetes/blob/release-1.26/pkg/kubelet/status/status_manager.go#L928
Deletion was blocked based on
PodResourcesAreReclaimed
which checks if volumes are unmounted - https://github.com/kubernetes/kubernetes/blob/release-1.26/pkg/kubelet/kubelet_pods.go#L949-L965As a result, this doesn’t look like a behavior change. The behavior has been that a pod can enter terminal phase before volumes are fully detached. However, for a pod to be deleted successfully, the volumes must have been unmounted. I believe this has been done to reduce latency for pods (to ensure we don’t wait for volumes to detach), and the fact that the kubelet does not “own” the volume resources (as mentioned by the comment - https://github.com/kubernetes/kubernetes/issues/120917#issuecomment-1759417824), so this can be asynchronously after the pod is already in a terminal phase.