kubernetes: Static pods cannot be started any more after kubelet is restarted
What happened?
We are trying to deploy our k8s with v1.22.8. During the k8s cluster deploying, the control plane components is started by kubelet, but with changes of the static pod manifest files and kubelet restarted, the control plane components won’t start any more. The component containers has been killed, but the sandbox containers is still live. The kubelet log shows:
Apr 15 11:55:17 w71-qiurui-1330-12b kubelet[13249]: I0415 11:55:17.484952 13249 pod_workers.go:1213] "Pod worker has been requested for removal but is still not fully terminated" podUID=5bfc554f3aad946fbf22edab693ba6d3
Apr 15 11:55:17 w71-qiurui-1330-12b kubelet[13249]: I0415 11:55:17.484966 13249 pod_workers.go:1213] "Pod worker has been requested for removal but is still not fully terminated" podUID=e43c8c031472539371a5cbfce25d5a0d
Apr 15 11:55:17 w71-qiurui-1330-12b kubelet[13249]: I0415 11:55:17.484978 13249 pod_workers.go:1213] "Pod worker has been requested for removal but is still not fully terminated" podUID=6cb9b1d380102bffaa399a5147810cdd
Apr 15 11:55:17 w71-qiurui-1330-12b kubelet[13249]: I0415 11:55:17.484991 13249 pod_workers.go:1213] "Pod worker has been requested for removal but is still not fully terminated" podUID=5cdc8bc8da1745e6e46c7a1be804aa5c
The start of new static pods is blocked. Unti the kubelet be restarted again, the new pod can be started successfully.
The 1.22.8 has the #106394 to fix the pod restart issue, but it seems not fix it completely. And we did not find the issue on 1.22.3.
What did you expect to happen?
The sandbox containers of the old static pods should be killed, and the new pods should be started.
How can we reproduce it (as minimally and precisely as possible)?
It’s hard to reproduce it with the standard k8s deploying. It can only be reproduced in our private environment with our deploy tools.
It’s some possible reproduce steps:
- Install docker and kubelet.
- Create k8s controller plane yaml files in
/etc/kubernetes/manifests. - Modify the controller plane yaml files, such as adding annotations, etc.
- Restart the kubelet.
- Repeat the step 3 a couple times.
Anything else we need to know?
The kubelet log kubelet.tar.gz
Kubernetes version
1.22.8
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"$ uname -a
$ uname -a
Linux global-0 3.10.0-1160.53.1.el7.x86_64 #1 SMP Fri Jan 14 13:59:45 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 2
- Comments: 20 (13 by maintainers)
Problem is similar to #105543 - when we don’t have apiserver pod info, the kubelet behaves incorrectly for static pods (which shouldn’t be impacted by the apiserver). I am starting to believe the “all sources ready” check in both housekeeping and delete pod is not sufficient - housekeeping needs to be able to detect whether a specific pod is truly missing or not, and that depends on knowing the source of the pod. I added some comments in https://github.com/kubernetes/kubernetes/pull/111901#issuecomment-1240963051 around something that could potentially address both issues.
The conditions of case are as follows:
the general logs are as follows: