kubernetes: kubelet is not able to delete pod with mounted secret/configmap after restart

From https://github.com/kubernetes/kubernetes/issues/96038#issuecomment-728928671

What happened: In https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1328563866136743936 one of the nodes (gce-scale-cluster-minion-group-bddx) restarted for some reason (some kernel panic).

The last kubelet log entry is at 08:13:05.362859 and the first one after restart is 08:15:40.361826.

In the meantime (at 08:13:17.496033), one of the pods (small-deployment-167-56c965c4cf-9pw8k) running on that kubelet has been deleted by generic-garbage-collector (i.e. deletionTimestamp has been set).

Kubelet after restart was never able to mark this pod as deleted (i.e. object has been never actually deleted).

What you expected to happen: After kubelet’s restart, the pod object will be deleted from etcd.

How to reproduce it (as minimally and precisely as possible): Based on our logs, stopping kubelet for a while, deleting pod running on it and restarting kubelet should trigger this issue.

Potentially, it may be important that the pod was using configmap (that has been already deleted). Anything else we need to know?:

Link to the test run: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1328563866136743936 Kubelet’s logs: http://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1328563866136743936/artifacts/gce-scale-cluster-minion-group-bddx/kubelet.log Pod name: test-sk0eco-5/small-deployment-167-56c965c4cf-9pw8k More logs (like all kube-apiserver’s logs for that pod can be found here): https://github.com/kubernetes/kubernetes/issues/96038#issuecomment-728939343

Environment:

Kubernetes version (use kubectl version): v1.20.0-beta.1.663+147a120948482e
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

/cc @kubernetes/sig-node-bugs

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 24 (17 by maintainers)

Commits related to this issue

Merge pull request #1822 from mborsz/miti Mitigate https://github.com/kubernetes/kubernetes/issues/96635 in load test — committed to kubernetes/perf-tests by k8s-ci-robot 3 years ago

Most upvoted comments

I was taking a look at reproducing and fixing this issue and wanted to post my findings. Basically what is happening is:

Pods gets created.
While pod was running secret is deleted.
Kubelet gets restarted.
Kubelet does reconstruction and volume gets added to DSOW. It would have got added to DSOW anyways, because pod was still running. The key thing is volume does not get added to ASOW during reconcile because secret/configmap is not found.
Pod is deleted.
Now the code here- https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/volumemanager/populator/desired_state_of_world_populator.go#L249 prevents the volume+pod from being removed from DSOW because pod never made it to ASOW.
So pod_worker tries to terminate the container and pod, and most of it works fine. But kubelet is unable to fully terminate the pod because https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_volumes.go#L77 returns true.
And hence pod is stuck in terminating state.

The problem is - we simply can’t choose to skip adding volumes to DSOW if pod has deletionTimestamp, because that will result in volume never getting cleaned up. So fix proposed https://github.com/kubernetes/kubernetes/pull/96790 is not full proof.

A real solution IMO is to add all pods+volumes in uncertain state during reconstruction, so as volumes can be removed from DSOW and volumes are still required to be cleaned up before pod can be terminated. @jsafrane has a PR that implements part of this solution - https://github.com/kubernetes/kubernetes/pull/108180 . I am looking in using it to fix this bug.

gnufied on Jun 2, 2022