k3s: Node stuck in NotReady status after temporary interface disconnect

Environmental Info: K3s Version: k3s version v1.22.2+k3s1 (10bca343) go version go1.16.8

Node(s) CPU architecture, OS, and Version: Linux somehostname1 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Linux somehostname2 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Linux somehostname3 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 3 node HA cluster with embedded etcd.

NAME               STATUS   ROLES                       AGE     VERSION
somehostname1   Ready    control-plane,etcd,master   5m      v1.22.2+k3s1
somehostname2   Ready    control-plane,etcd,master   4m      v1.22.2+k3s1
somehostname3   Ready    control-plane,etcd,master   3m47s   v1.22.2+k3s1

Nothing is deployed on the cluster except coredns and local-path-provisioner.

Describe the bug: Node is stuck in NotReady status after temporarily disconnecting interface and reconnecting.

Steps To Reproduce: Set up a three node HA cluster with embedded etcd using k3s version v1.22.2+k3s1.

  1. On one node, bring the interface down: ifdown $interface
  2. Wait till Node goes into NotReady status. Determined by watching kubectl get nodes from one of the remaining 2 nodes.
  3. Bring the interface back up: ifup $interface

The node does not return to Ready, although it appears to be reconnected.

Notes:

  • I’ve tested this multiple times - disconnecting both the etcd leader and non etcd leader nodes.
  • Physically unplugging the ethernet cable, waiting, and then plugging the cable back in has the same effect.
  • Occasionally it does recover by itself (within a minute). However most of the time it does not.
  • I’ve let it sit for over 12 hours without it recovering.
  • Restarting k3s manually (systemctl restart k3s) does restore the node.
  • etcdctl endpoint status --cluster indicates that the node is back in the etcd cluster after the interface is brought back up.

kubectl describe node on the “failed” node shows:

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Wed, 06 Oct 2021 14:23:52 -0700   Wed, 06 Oct 2021 14:24:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Wed, 06 Oct 2021 14:23:52 -0700   Wed, 06 Oct 2021 14:24:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Wed, 06 Oct 2021 14:23:52 -0700   Wed, 06 Oct 2021 14:24:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Wed, 06 Oct 2021 14:23:52 -0700   Wed, 06 Oct 2021 14:24:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.

Expected behavior: When the interface is reconnected the node should return to Ready status.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 17 (9 by maintainers)

Most upvoted comments

Performed the same steps mentioned in https://github.com/k3s-io/k3s/issues/4239#issuecomment-956933405 using master branch commit 702fe24afe3f08f96ebc167313c20c0339a5510f