rancher: etcd node changes can cause unhealthy cluster (especially in case of leader re-election, slow etcd or large etcd size)

**What kind of request is this (question/bug/enhancement/feature request):Bug

Steps to reproduce (least amount of steps as possible): Rancher Version: v2.3.2 Kubernetes Version: v1.13.5 Docker Version: 17.3.2 AWS Worker Node Version: CentOS Linux release 7.6.1810 , m5.2xlarge , kernel:4.4.184-1.el7.elrepo.x86_64 The cluster is Amazon EC2 Launched Cluster via Rancher GUI. The cluster has 3 separate nodes which handle etcd & controlpanel… Worker nodes are seperate and don’t have etcd & controlpanel

1) Try to add a AWS worker node. The node is stuck during provisioning. I see following commands run in /var/log/secure on the node
============================================
grep COMMAND /var/log/secure  | grep <provisioninguser>
Dec 11 11:34:22 ip-x-x-x-x sudo: <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/hostname blah-blah
Dec 11 11:34:22 ip-x-x-x-x sudo:  <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/tee /etc/hostname
Dec 11 11:34:25 ip-x-x-x-x sudo:  <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/tee -a /etc/hosts
Dec 11 11:34:29 ip-x-x-x-x sudo:  <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/yum install -y curl
Dec 11 11:34:58 ip-x-x-x-x sudo:  <provisioninguser> : TTY=pts/0 ; PWD=/home/<provisioninguser> ; USER=root ; COMMAND=/bin/yum -y update -x docker-*

Then the provisioning user logs out after couple of minutes without doing anything
Dec 11 11:37:27 <hostname> sshd[pid]: pam_unix(sshd:session): session closed for user <provisioninguser>

Result: AWS Worker node stuck in provisioning and then when tried to remove it , its stuck in "Removing " State with message Provisioning with <provisioninguser>...; waiting on node-controller

Other details that may be helpful:

Environment information Rancher Version: v2.3.2 Kubernetes Version: v1.13.5 Docker Version: 17.3.2

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): rancher/rancher-agent:v2.3.2 rancher/hyperkube:v1.13.5-rancher1 rancher/rancher:v2.3.2 ( This is run in a separate “local” k8s cluster
  • Installation option (single install/HA):HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Infrastructure Provider Amazon EC2
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): AWS m5.2xlarge
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", GitCommit:"66049e3b21efe110454d67df4fa62b08ea79a19b", GitTreeState:"clean", BuildDate:"2019-05-16T18:55:03Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
  • Docker version (use docker version):
Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:21:36 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:21:36 2017
 OS/Arch:      linux/amd64
 Experimental: false

image

gz#9105

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16 (5 by maintainers)

Most upvoted comments

Based on https://github.com/rancher/rancher/issues/24547#issuecomment-645008685, the following was investigated:

As it wasn’t consistently reproduced, I started with testing removing one and two nodes at a time. Sometimes this would succeed, sometimes this would fail. Based on this outcome I tried checking the etcd leader and make sure that when deleting a node, it was the etcd leader. This caused more delay in recovering the cluster because re-election is triggered. Because there is no wait/healthcheck on member delete, there is not enough time to re-elect a leader and either the next member delete (in case of deleting multiple nodes) or the cluster health check that we do on reconcile only is coming too soon and doesnt have enough time to let the cluster settle.

The fix is to add a health check the cluster on member delete.

There should also be a warning presented when deleting a node in a 2 node cluster, because etcd actually documented that this can be unsafe (see https://github.com/etcd-io/etcd/tree/master/raft#implementation-notes):

This approach introduces a problem when removing a member from a two-member cluster: If one of the members dies before the other one receives the commit of the confchange entry, then the member cannot be removed any more since the cluster cannot make progress. For this reason it is highly recommended to use three or more nodes in every cluster.

To reproduce (either RKE CLI or Rancher node driver):

To make it easier to reproduce, raise the election-timeout on etcd so it takes longer to elect a new leader:

services:
  etcd:
    extra_args:
      election-timeout: "50000"