kubernetes: Graceful Node Shutdown regression in 1.21 → 1.22

What happened:

Shutdown a node gracefully. E.g. with systemctl reboot.

What you expected to happen:

Expected to see pods being evicted, and pods being started on another nodes.

How to reproduce it (as minimally and precisely as possible):

Let a deployment create one pod, configure Graceful Node Shutdown as described in https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/, then just systemctl reboot that node. Then, the pod will get a state of Terminated or Completed, but no new pod is being created and scheduled on a different node. Then, when the node comes back, the terminated pod stays in such a state:

home-assistant-5dfcb44d74-rg2zs                       1/1     Terminated   2          30m   10.112.0.69    k8s-node13   <none>           <none>

Effectively, the pod is running on the node which has been rebooted gracefully, but is in a Terminated state.

Anything else we need to know?:

This worked as expected in v1.21

Environment:

Kubernetes version (use kubectl version):

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:39:34Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/arm"}

Cloud provider or hardware configuration: bare metal
OS (e.g: cat /etc/os-release):

# cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel (e.g. uname -a):

# uname -a
Linux k8s-node07 5.10.0-8-armmp #1 SMP Debian 5.10.46-4 (2021-08-03) armv7l GNU/Linux

Install tools: kubeadm
Network plugin and version (if this is a network-related bug): kube-router
Others:

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 19 (16 by maintainers)

Most upvoted comments

@ehashman @rphillips see upthread, Jordan and I also briefly discussed back porting the config flag that would set the default as well (we should explore that in the fix), and for distros that want to use beta graceful shutdown but are not using transient nodes, we should recommend that they leverage that flag in order to avoid data loss on pods (if we indeed decide to include the flag).

smarterclayton on Nov 30, 2021

I agree with Jordan, that is what we should have done. People using graceful shutdown on non-terminal pods should be informed of this behavior change as well (someone deploying in 1.22 not using transient nodes may be impacted by this in a way that causes data loss, for instance if they have data in pod local dirs, and have started using graceful deletion in 1.22 unaware of the previous behavior).

Jordan argued that as a beta feature, changing the default in 1.23->1.24 is acceptable, whereas changing the default in a back port fix is unusual, and per our normal behavior with regressions, even with enough of a gap that people might anticipate the new behavior, we try to go back to our previous behavior in patch versions.

smarterclayton on Nov 30, 2021

This is a regression in 1.22, caused by https://github.com/kubernetes/kubernetes/commit/3eadd1a9ead7a009a9abfbd603a5efd0560473cc#diff-d0efebc6b30989428cd9bc3f04bb15cd75134e38ad38ce21a648fed69389c7aeL273

The change was intentional, since restarting a node should not necessarily stop all pods/containers on that node (see https://github.com/kubernetes/kubernetes/pull/104798#issuecomment-924038286), but for people depending on the existing behavior, it is definitely a regression.

I think we should do the following:

Fix the regression and backport to 1.22.x and 1.23.x to restore the delete-on-shutdown behavior
In 1.24, add a kubelet option to permit choosing whether pods are terminated on node shutdown or not and change the default to not terminate. People who want the prior behavior can opt into setting the “terminate-on-shutdown” option in 1.24. This should be done before promoting graceful node shutdown to GA

liggitt on Nov 30, 2021

The plan in https://github.com/kubernetes/kubernetes/issues/104531#issuecomment-982763592 makes sense.

I’ve opened up the following changes to revert to the previous 1.21 behavior.

We will followup with option to make it configurable in 1.24 as we discussed.

bobbypage on Dec 9, 2021

Hi @mkimuram

It seems that the node status is not synchronized in time

https://github.com/kubernetes/kubernetes/blob/eb729620c522753bc7ae61fc2c7b7ea19d4aad2f/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L222

Could you remove the go and try again to see if it can be avoided

wzshiming on Sep 7, 2021

/sig node /cc @bobbypage @wzshiming

pacoxu on Aug 24, 2021