rancher: etcd node changes can cause unhealthy cluster (especially in case of leader re-election, slow etcd or large etcd size)
**What kind of request is this (question/bug/enhancement/feature request):Bug
Steps to reproduce (least amount of steps as possible): Rancher Version: v2.3.2 Kubernetes Version: v1.13.5 Docker Version: 17.3.2 AWS Worker Node Version: CentOS Linux release 7.6.1810 , m5.2xlarge , kernel:4.4.184-1.el7.elrepo.x86_64 The cluster is Amazon EC2 Launched Cluster via Rancher GUI. The cluster has 3 separate nodes which handle etcd & controlpanel… Worker nodes are seperate and don’t have etcd & controlpanel
1) Try to add a AWS worker node. The node is stuck during provisioning. I see following commands run in /var/log/secure on the node
============================================
grep COMMAND /var/log/secure | grep <provisioninguser>
Dec 11 11:34:22 ip-x-x-x-x sudo: <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/hostname blah-blah
Dec 11 11:34:22 ip-x-x-x-x sudo: <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/tee /etc/hostname
Dec 11 11:34:25 ip-x-x-x-x sudo: <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/tee -a /etc/hosts
Dec 11 11:34:29 ip-x-x-x-x sudo: <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/yum install -y curl
Dec 11 11:34:58 ip-x-x-x-x sudo: <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/yum -y update -x docker-*
Then the provisioning user logs out after couple of minutes without doing anything
Dec 11 11:37:27 <hostname> sshd[pid]: pam_unix(sshd:session): session closed for user <provisioninguser>
Result:
AWS Worker node stuck in provisioning and then when tried to remove it , its stuck in "Removing " State with message
Provisioning with <provisioninguser>...; waiting on node-controller
Other details that may be helpful:
Environment information Rancher Version: v2.3.2 Kubernetes Version: v1.13.5 Docker Version: 17.3.2
- Rancher version (
rancher/rancher
/rancher/server
image tag or shown bottom left in the UI): rancher/rancher-agent:v2.3.2 rancher/hyperkube:v1.13.5-rancher1 rancher/rancher:v2.3.2 ( This is run in a separate “local” k8s cluster - Installation option (single install/HA):HA
Cluster information
- Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Infrastructure Provider Amazon EC2
- Machine type (cloud/VM/metal) and specifications (CPU/memory): AWS m5.2xlarge
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", GitCommit:"66049e3b21efe110454d67df4fa62b08ea79a19b", GitTreeState:"clean", BuildDate:"2019-05-16T18:55:03Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
- Docker version (use
docker version
):
Client:
Version: 17.03.2-ce
API version: 1.27
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 02:21:36 2017
OS/Arch: linux/amd64
Server:
Version: 17.03.2-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 02:21:36 2017
OS/Arch: linux/amd64
Experimental: false
gz#9105
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (5 by maintainers)
Based on https://github.com/rancher/rancher/issues/24547#issuecomment-645008685, the following was investigated:
As it wasn’t consistently reproduced, I started with testing removing one and two nodes at a time. Sometimes this would succeed, sometimes this would fail. Based on this outcome I tried checking the etcd leader and make sure that when deleting a node, it was the etcd leader. This caused more delay in recovering the cluster because re-election is triggered. Because there is no wait/healthcheck on member delete, there is not enough time to re-elect a leader and either the next member delete (in case of deleting multiple nodes) or the cluster health check that we do on reconcile only is coming too soon and doesnt have enough time to let the cluster settle.
The fix is to add a health check the cluster on member delete.
There should also be a warning presented when deleting a node in a 2 node cluster, because etcd actually documented that this can be unsafe (see https://github.com/etcd-io/etcd/tree/master/raft#implementation-notes):
To reproduce (either RKE CLI or Rancher node driver):
To make it easier to reproduce, raise the election-timeout on etcd so it takes longer to elect a new leader: