k3s: Node stuck in NotReady status after temporary interface disconnect
Environmental Info: K3s Version: k3s version v1.22.2+k3s1 (10bca343) go version go1.16.8
Node(s) CPU architecture, OS, and Version: Linux somehostname1 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Linux somehostname2 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Linux somehostname3 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration: 3 node HA cluster with embedded etcd.
NAME STATUS ROLES AGE VERSION
somehostname1 Ready control-plane,etcd,master 5m v1.22.2+k3s1
somehostname2 Ready control-plane,etcd,master 4m v1.22.2+k3s1
somehostname3 Ready control-plane,etcd,master 3m47s v1.22.2+k3s1
Nothing is deployed on the cluster except coredns and local-path-provisioner.
Describe the bug: Node is stuck in NotReady status after temporarily disconnecting interface and reconnecting.
Steps To Reproduce: Set up a three node HA cluster with embedded etcd using k3s version v1.22.2+k3s1.
- On one node, bring the interface down:
ifdown $interface - Wait till Node goes into NotReady status. Determined by watching
kubectl get nodesfrom one of the remaining 2 nodes. - Bring the interface back up:
ifup $interface
The node does not return to Ready, although it appears to be reconnected.
Notes:
- I’ve tested this multiple times - disconnecting both the etcd leader and non etcd leader nodes.
- Physically unplugging the ethernet cable, waiting, and then plugging the cable back in has the same effect.
- Occasionally it does recover by itself (within a minute). However most of the time it does not.
- I’ve let it sit for over 12 hours without it recovering.
- Restarting k3s manually (systemctl restart k3s) does restore the node.
etcdctl endpoint status --clusterindicates that the node is back in the etcd cluster after the interface is brought back up.
kubectl describe node on the “failed” node shows:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Wed, 06 Oct 2021 14:23:52 -0700 Wed, 06 Oct 2021 14:24:52 -0700 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 06 Oct 2021 14:23:52 -0700 Wed, 06 Oct 2021 14:24:52 -0700 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Wed, 06 Oct 2021 14:23:52 -0700 Wed, 06 Oct 2021 14:24:52 -0700 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Wed, 06 Oct 2021 14:23:52 -0700 Wed, 06 Oct 2021 14:24:52 -0700 NodeStatusUnknown Kubelet stopped posting node status.
Expected behavior: When the interface is reconnected the node should return to Ready status.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 17 (9 by maintainers)
Performed the same steps mentioned in https://github.com/k3s-io/k3s/issues/4239#issuecomment-956933405 using master branch commit
702fe24afe3f08f96ebc167313c20c0339a5510f