kubernetes: Kubelet cannot start static pod when move in manifest file after removing manifest file for a while
What happened?
An error occurs in managePodLoop
after the static pod has being deleted, then syncCh
is triggerred and error event will be dequeued from work queue but pod manage has no such pod because HandlePodRemoves
has been called.
No such container error
is a common error type from p.podCache.GetNewerThan
or syncTerminatingPodFn
method in managePodLoop
.
So this event can not trigger HandlePodSyncs
to finish pod terminating.
Because of terminating state of pod, SyncLoop ADD
will mark restartable flag and return.
So the housekeeper becomes the only way to start the pod. But also due to terminating state(not finished), the pod cannot be added to restartablePods
, so it cannot be retarted again.
Eventually, the pod will never be restarted again.
What did you expect to happen?
Static pod can be started or restarted again.
How can we reproduce it (as minimally and precisely as possible)?
We found that the pod would fail to start after the events occurred in the following order:
-
HandlePodRemoves -> podManager.DeletePod -> managePodLoop -> err occurred, like “No such container” -> completeWork -> workQueue.Enqueue
-
syncCh -> getPodsToSync -> workQueue.GetWork() -> podManager.GetPods() -> pod has been deleted, and HandlePodSyncs cannot be entered
-
HandlePodAdditions -> podManager.AddPod -> status.IsTerminationRequested -> status.restartRequested=true -> return
-
HandlePodCleanups -> podWorkers.SyncKnownPods -> removeTerminatedWorker -> !status.finished -> do not delete podSyncStatuses -> status.terminatedAt.IsZero() && !status.terminatingAt.IsZero() -> still in TerminatingPod state
Then the pod will never be started due to lack of trigger of pod worker, unless the kubelet is restarted.
Stub method for error in managePodLoop:
- Add a delay between
podSandboxIDs, err := m.getSandboxIDByPodUID(uid, nil)
andpodSandboxStatus, err := m.runtimeService.PodSandboxStatus(podSandboxID)
in for range inGetPodStatus
insyncTerminatingPod
. - Once
getSandboxIDByPodUID
has been called, kill and remove the pause container of the pod manually. - An
No such container
error occurs.
Kubelet Log:
I1213 17:26:45.869540 19130 kubelet.go:2249] "SyncLoop REMOVE" source="file" pods=[fst-manage/webhook-paas-192-168-2-2]
I1213 17:26:45.869806 19130 kuberuntime_container.go:721] "Killing container with a grace period" pod="fst-manage/webhook-paas-192-168-2-2" podUID=3a91011824b5370050c4bb2c2c0705ad containerName="webhook" containerID="docker://9ff7d0fbdb1f96db833ae273cee55cec2ae73cdd201088d9b6af515c2879d350" gracePeriod=5
I1213 17:26:46.211320 19130 reconciler.go:197] "operationExecutor.UnmountVolume started for volume \"localtime\" (UniqueName: \"kubernetes.io/host-path/3a91011824b5370050c4bb2c2c0705ad-localtime\") pod \"3a91011824b5370050c4bb2c2c0705ad\" (UID: \"3a91011824b5370050c4bb2c2c0705ad\") "
...
I1213 17:26:46.761943 19130 generic.go:159] "GenericPLEG" podUID=3a91011824b5370050c4bb2c2c0705ad containerID="9ff7d0fbdb1f96db833ae273cee55cec2ae73cdd201088d9b6af515c2879d350" oldState=running newState=exited
I1213 17:26:46.761949 19130 generic.go:159] "GenericPLEG" podUID=3a91011824b5370050c4bb2c2c0705ad containerID="5b0e3b74c2a73880c3e984fac3c40c8ef0648c92b3df6c0cbf178d4541af5e2d" oldState=running newState=exited
E1213 17:26:47.292895 19130 kuberuntime_manager.go:1091] "getPodContainerStatuses for pod failed" err="Error: No such container: 740ab753973f7ad32c5ef2c9778fb061469fc1a664710988569a46deb7eea66e" pod="fst-manage/webhook-paas-192-168-2-2"
E1213 17:26:47.292948 19130 pod_workers.go:894] "Error syncing pod, skipping" err="Error: No such container: 740ab753973f7ad32c5ef2c9778fb061469fc1a664710988569a46deb7eea66e" pod="fst-manage/webhook-paas-192-168-2-2" podUID=3a91011824b5370050c4bb2c2c0705ad
I1213 17:26:48.312614 19130 kubelet_volumes.go:161] "Cleaned up orphaned pod volumes dir" podUID=3a91011824b5370050c4bb2c2c0705ad path="/var/lib/kubelet/pods/3a91011824b5370050c4bb2c2c0705ad/volumes"
I1213 17:26:48.712057 19130 kubelet.go:2281] "SyncLoop (PLEG): pod does not exist, ignore irrelevant event" event=&{ID:3a91011824b5370050c4bb2c2c0705ad Type:ContainerDied Data:5b0e3b74c2a73880c3e984fac3c40c8ef0648c92b3df6c0cbf178d4541af5e2d}
I1213 17:27:00.876529 19130 kubelet.go:2239] "SyncLoop ADD" source="file" pods=[fst-manage/webhook-paas-192-168-2-2]
Another reoccurrence:
I0110 11:56:03.480296 25160 kubelet.go:2249] "SyncLoop REMOVE" source="file" pods=[fst-manage/kube-apiserver-paas-192-168-2-2]
I0110 11:56:03.480433 25160 pod_workers.go:840] "Got non-process pod, start sync pod" pod="fst-manage/kube-apiserver-paas-192-168-2-2"
I0110 11:56:03.480768 25160 kuberuntime_container.go:721] "Killing container with a grace period" pod="fst-manage/kube-apiserver-paas-192-168-2-2" podUID=c6dc565f27a76405ed35d58bc441a407 containerName="kube-apiserver" containerID="docker://524779760c0050c8391419ea54295020f42460802cea89a2f11df8d08cb87f15" gracePeriod=1
I0110 11:56:05.297011 25160 generic.go:159] "GenericPLEG" podUID=c6dc565f27a76405ed35d58bc441a407 containerID="524779760c0050c8391419ea54295020f42460802cea89a2f11df8d08cb87f15" oldState=running newState=exited
I0110 11:56:06.177361 25160 generic.go:338] "Generic (PLEG): container finished" podID=c6dc565f27a76405ed35d58bc441a407 containerID="524779760c0050c8391419ea54295020f42460802cea89a2f11df8d08cb87f15" exitCode=137
I0110 11:56:06.177414 25160 kubelet.go:2281] "SyncLoop (PLEG): pod does not exist, ignore irrelevant event" event=&{ID:c6dc565f27a76405ed35d58bc441a407 Type:ContainerDied Data:524779760c0050c8391419ea54295020f42460802cea89a2f11df8d08cb87f15}
E0110 11:56:07.147581 25160 kuberuntime_manager.go:1091] "getPodContainerStatuses for pod failed" err="Error: No such container: fa6daeb414be6249f3f8002c7c6098bcf888debeebcef108d7c6b2fb5cfea447" pod="fst-manage/kube-apiserver-paas-192-168-2-2"
E0110 11:56:07.147608 25160 kubelet.go:1983] "Unable to read pod status prior to final pod termination" err="Error: No such container: fa6daeb414be6249f3f8002c7c6098bcf888debeebcef108d7c6b2fb5cfea447" pod="fst-manage/kube-apiserver-paas-192-168-2-2" podUID=c6dc565f27a76405ed35d58bc441a407
E0110 11:56:07.147629 25160 pod_workers.go:894] "Error syncing pod, skipping" err="Error: No such container: fa6daeb414be6249f3f8002c7c6098bcf888debeebcef108d7c6b2fb5cfea447" pod="fst-manage/kube-apiserver-paas-192-168-2-2" podUID=c6dc565f27a76405ed35d58bc441a407
I0110 11:56:07.245849 25160 generic.go:159] "GenericPLEG" podUID=c6dc565f27a76405ed35d58bc441a407 containerID="fa6daeb414be6249f3f8002c7c6098bcf888debeebcef108d7c6b2fb5cfea447" oldState=exited newState=non-existent
I0110 11:56:07.245860 25160 generic.go:159] "GenericPLEG" podUID=c6dc565f27a76405ed35d58bc441a407 containerID="349f55d33dc68eddebc3383a251e8048ceaa83cf0b006843b99e168736e61b82" oldState=running newState=exited
I0110 11:56:07.257231 25160 kubelet.go:2281] "SyncLoop (PLEG): pod does not exist, ignore irrelevant event" event=&{ID:c6dc565f27a76405ed35d58bc441a407 Type:ContainerDied Data:349f55d33dc68eddebc3383a251e8048ceaa83cf0b006843b99e168736e61b82}
I0110 11:57:20.569359 25160 kubelet.go:2239] "SyncLoop ADD" source="file" pods=[fst-manage/kube-apiserver-paas-192-168-2-2]
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
Kubernetes version: v1.22.1-c5fe23a7fbdd1d996731d6c996ba8aed3804d471
and already backportted #105075
Cloud provider
Local machine
OS version
$ cat /etc/os-release
NAME="EulerOS"
VERSION="2.0 (SP10x86_64)"
ID="euleros"
VERSION_ID="2.0"
PRETTY_NAME="EulerOS 2.0 (SP10x86_64)"
ANSI_COLOR="0;31"
$ uname -a
Linux host-192-168-2-2 4.18.0-147.5.2.5.h732.eulerosv2r10.x86_64 #1 SMP Sat Nov 27 16:33:23 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Install tools
Container runtime (CRI) and and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 21 (15 by maintainers)
We have found the issue is still existing at 1.22.8, the #108189 doesn’t fix it completely.