k3s: Node stuck in NotReady status after temporary interface disconnect

Environmental Info: K3s Version: k3s version v1.22.2+k3s1 (10bca343) go version go1.16.8

Node(s) CPU architecture, OS, and Version: Linux somehostname1 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Linux somehostname2 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Linux somehostname3 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 3 node HA cluster with embedded etcd.

NAME               STATUS   ROLES                       AGE     VERSION
somehostname1   Ready    control-plane,etcd,master   5m      v1.22.2+k3s1
somehostname2   Ready    control-plane,etcd,master   4m      v1.22.2+k3s1
somehostname3   Ready    control-plane,etcd,master   3m47s   v1.22.2+k3s1

Nothing is deployed on the cluster except coredns and local-path-provisioner.

Describe the bug: Node is stuck in NotReady status after temporarily disconnecting interface and reconnecting.

Steps To Reproduce: Set up a three node HA cluster with embedded etcd using k3s version v1.22.2+k3s1.

On one node, bring the interface down: ifdown $interface
Wait till Node goes into NotReady status. Determined by watching kubectl get nodes from one of the remaining 2 nodes.
Bring the interface back up: ifup $interface

The node does not return to Ready, although it appears to be reconnected.

Notes:

I’ve tested this multiple times - disconnecting both the etcd leader and non etcd leader nodes.
Physically unplugging the ethernet cable, waiting, and then plugging the cable back in has the same effect.
Occasionally it does recover by itself (within a minute). However most of the time it does not.
I’ve let it sit for over 12 hours without it recovering.
Restarting k3s manually (systemctl restart k3s) does restore the node.
etcdctl endpoint status --cluster indicates that the node is back in the etcd cluster after the interface is brought back up.

kubectl describe node on the “failed” node shows:

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Wed, 06 Oct 2021 14:23:52 -0700   Wed, 06 Oct 2021 14:24:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Wed, 06 Oct 2021 14:23:52 -0700   Wed, 06 Oct 2021 14:24:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Wed, 06 Oct 2021 14:23:52 -0700   Wed, 06 Oct 2021 14:24:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Wed, 06 Oct 2021 14:23:52 -0700   Wed, 06 Oct 2021 14:24:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.

Expected behavior: When the interface is reconnected the node should return to Ready status.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 17 (9 by maintainers)

Most upvoted comments

Performed the same steps mentioned in https://github.com/k3s-io/k3s/issues/4239#issuecomment-956933405 using master branch commit 702fe24afe3f08f96ebc167313c20c0339a5510f

rancher-max on Nov 2, 2021