rancher: etcd node changes can cause unhealthy cluster (especially in case of leader re-election, slow etcd or large etcd size)

**What kind of request is this (question/bug/enhancement/feature request):Bug

Steps to reproduce (least amount of steps as possible): Rancher Version: v2.3.2 Kubernetes Version: v1.13.5 Docker Version: 17.3.2 AWS Worker Node Version: CentOS Linux release 7.6.1810 , m5.2xlarge , kernel:4.4.184-1.el7.elrepo.x86_64 The cluster is Amazon EC2 Launched Cluster via Rancher GUI. The cluster has 3 separate nodes which handle etcd & controlpanel… Worker nodes are seperate and don’t have etcd & controlpanel

1) Try to add a AWS worker node. The node is stuck during provisioning. I see following commands run in /var/log/secure on the node
============================================
grep COMMAND /var/log/secure  | grep <provisioninguser>
Dec 11 11:34:22 ip-x-x-x-x sudo: <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/hostname blah-blah
Dec 11 11:34:22 ip-x-x-x-x sudo:  <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/tee /etc/hostname
Dec 11 11:34:25 ip-x-x-x-x sudo:  <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/tee -a /etc/hosts
Dec 11 11:34:29 ip-x-x-x-x sudo:  <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/yum install -y curl
Dec 11 11:34:58 ip-x-x-x-x sudo:  <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/yum -y update -x docker-*

Then the provisioning user logs out after couple of minutes without doing anything
Dec 11 11:37:27 <hostname> sshd[pid]: pam_unix(sshd:session): session closed for user <provisioninguser>

Result: AWS Worker node stuck in provisioning and then when tried to remove it , its stuck in "Removing " State with message `Provisioning with <provisioninguser>...; waiting on node-controller`

Other details that may be helpful:

Environment information Rancher Version: v2.3.2 Kubernetes Version: v1.13.5 Docker Version: 17.3.2

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): rancher/rancher-agent:v2.3.2 rancher/hyperkube:v1.13.5-rancher1 rancher/rancher:v2.3.2 ( This is run in a separate “local” k8s cluster
Installation option (single install/HA):HA

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Infrastructure Provider Amazon EC2
Machine type (cloud/VM/metal) and specifications (CPU/memory): AWS m5.2xlarge
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", GitCommit:"66049e3b21efe110454d67df4fa62b08ea79a19b", GitTreeState:"clean", BuildDate:"2019-05-16T18:55:03Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Docker version (use docker version):

Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:21:36 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:21:36 2017
 OS/Arch:      linux/amd64
 Experimental: false

gz#9105

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 16 (5 by maintainers)

Most upvoted comments

Based on https://github.com/rancher/rancher/issues/24547#issuecomment-645008685, the following was investigated:

As it wasn’t consistently reproduced, I started with testing removing one and two nodes at a time. Sometimes this would succeed, sometimes this would fail. Based on this outcome I tried checking the etcd leader and make sure that when deleting a node, it was the etcd leader. This caused more delay in recovering the cluster because re-election is triggered. Because there is no wait/healthcheck on member delete, there is not enough time to re-elect a leader and either the next member delete (in case of deleting multiple nodes) or the cluster health check that we do on reconcile only is coming too soon and doesnt have enough time to let the cluster settle.

The fix is to add a health check the cluster on member delete.

There should also be a warning presented when deleting a node in a 2 node cluster, because etcd actually documented that this can be unsafe (see https://github.com/etcd-io/etcd/tree/master/raft#implementation-notes):

This approach introduces a problem when removing a member from a two-member cluster: If one of the members dies before the other one receives the commit of the confchange entry, then the member cannot be removed any more since the cluster cannot make progress. For this reason it is highly recommended to use three or more nodes in every cluster.

To reproduce (either RKE CLI or Rancher node driver):

Create a 2 node and a 3 node cluster
Check the etcd leader using the command on https://rancher.com/docs/rancher/v2.x/en/troubleshooting/kubernetes-components/etcd/#check-endpoint-status
Make sure to delete the node that is the leader (in the 2 node cluster) and that node and another random node in the 3 node cluster
Observe the cluster failing because etcd is not recovering

To make it easier to reproduce, raise the election-timeout on etcd so it takes longer to elect a new leader:

services:
  etcd:
    extra_args:
      election-timeout: "50000"

superseb on Sep 29, 2020

rancher: etcd node changes can cause unhealthy cluster (especially in case of leader re-election, slow etcd or large etcd size)

Result: AWS Worker node stuck in provisioning and then when tried to remove it , its stuck in "Removing " State with message Provisioning with <provisioninguser>...; waiting on node-controller

About this issue

Most upvoted comments

Result: AWS Worker node stuck in provisioning and then when tried to remove it , its stuck in "Removing " State with message `Provisioning with <provisioninguser>...; waiting on node-controller`