kubernetes: Graceful Node Shutdown regression in 1.21 → 1.22
What happened:
Shutdown a node gracefully. E.g. with systemctl reboot.
What you expected to happen:
Expected to see pods being evicted, and pods being started on another nodes.
How to reproduce it (as minimally and precisely as possible):
Let a deployment create one pod, configure Graceful Node Shutdown as described in https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/, then just systemctl reboot that node. Then, the pod will get a state of Terminated or Completed, but no new pod is being created and scheduled on a different node. Then, when the node comes back, the terminated pod stays in such a state:
home-assistant-5dfcb44d74-rg2zs 1/1 Terminated 2 30m 10.112.0.69 k8s-node13 <none> <none>
Effectively, the pod is running on the node which has been rebooted gracefully, but is in a Terminated state.
Anything else we need to know?:
This worked as expected in v1.21
Environment:
- Kubernetes version (use
kubectl version):
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:39:34Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/arm"}
- Cloud provider or hardware configuration: bare metal
- OS (e.g:
cat /etc/os-release):
# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
- Kernel (e.g.
uname -a):
# uname -a
Linux k8s-node07 5.10.0-8-armmp #1 SMP Debian 5.10.46-4 (2021-08-03) armv7l GNU/Linux
- Install tools: kubeadm
- Network plugin and version (if this is a network-related bug): kube-router
- Others:
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 19 (16 by maintainers)
@ehashman @rphillips see upthread, Jordan and I also briefly discussed back porting the config flag that would set the default as well (we should explore that in the fix), and for distros that want to use beta graceful shutdown but are not using transient nodes, we should recommend that they leverage that flag in order to avoid data loss on pods (if we indeed decide to include the flag).
I agree with Jordan, that is what we should have done. People using graceful shutdown on non-terminal pods should be informed of this behavior change as well (someone deploying in 1.22 not using transient nodes may be impacted by this in a way that causes data loss, for instance if they have data in pod local dirs, and have started using graceful deletion in 1.22 unaware of the previous behavior).
Jordan argued that as a beta feature, changing the default in 1.23->1.24 is acceptable, whereas changing the default in a back port fix is unusual, and per our normal behavior with regressions, even with enough of a gap that people might anticipate the new behavior, we try to go back to our previous behavior in patch versions.
This is a regression in 1.22, caused by https://github.com/kubernetes/kubernetes/commit/3eadd1a9ead7a009a9abfbd603a5efd0560473cc#diff-d0efebc6b30989428cd9bc3f04bb15cd75134e38ad38ce21a648fed69389c7aeL273
The change was intentional, since restarting a node should not necessarily stop all pods/containers on that node (see https://github.com/kubernetes/kubernetes/pull/104798#issuecomment-924038286), but for people depending on the existing behavior, it is definitely a regression.
I think we should do the following:
The plan in https://github.com/kubernetes/kubernetes/issues/104531#issuecomment-982763592 makes sense.
I’ve opened up the following changes to revert to the previous 1.21 behavior.
We will followup with option to make it configurable in 1.24 as we discussed.
Hi @mkimuram
It seems that the node status is not synchronized in time
https://github.com/kubernetes/kubernetes/blob/eb729620c522753bc7ae61fc2c7b7ea19d4aad2f/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L222
Could you remove the go and try again to see if it can be avoided
/sig node /cc @bobbypage @wzshiming