kubeadm: kubeadm join is not fault tolerant to etcd endpoint failures

What keywords did you search in kubeadm issues before filing this one?

etcd kubeadm join clusterstatus

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): $ kubeadm version kubeadm version: &version.Info{Major:“1”, Minor:“13”, GitVersion:“v1.13.4”, GitCommit:“c27b913fddd1a6c480c229191a087698aa92f0b1”, GitTreeState:“clean”, BuildDate:“2019-02-28T13:35:32Z”, GoVersion:“go1.11.5”, Compiler:“gc”, Platform:“linux/amd64”}

Environment:

  • Kubernetes version (use kubectl version): 1.13.4
  • Cloud provider or hardware configuration: Self managed AWS
  • OS (e.g. from /etc/os-release):Ubuntu 18.04.2
  • Kernel (e.g. uname -a): 4.15.0-1032-aws
  • Others:

What happened?

kubeadm join --experimental-control-plane sporadically fails when adding a new node to the control plane cluster after a node is removed.

What you expected to happen?

For the join to succeed.

How to reproduce it (as minimally and precisely as possible)?

Create an HA stacked control plane cluster. Terminate one of the control plane nodes. Start another node, remove the failed etcd member, delete the failed node (kubectl delete node …) and run kubeadm join --experimental-control-plane on the new node.

Anything else we need to know?

This is due to a few things:

  1. The ClusterStatus in the config map still lists the node that has been terminated/removed.
  2. There is a bug in go-grpc that manifests in the etcd v3 client where if the first endpoint used in the client constructor is not responsive, the other endpoints provided to the constructor aren’t tried. https://github.com/etcd-io/etcd/pull/10489
  3. golang does not guarantee order when using range on maps, such as the ClusterStatus.apiEndpoints map. If there is 1 “bad” endpoint in the ClusterStatus it may be the first one in the endpoints list, or it may not. If it is not the first one and a “healthy” endpoint is then kubeadm does the right thing. If it is the first one then the etcd client fails and kubeadm fails.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 29 (14 by maintainers)

Most upvoted comments

@fabriziopandini Thanks, removed the dead etcd node manually!

ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove <node_id>

Join thereafter worked as expected!

Hitting the same issue with 1.17.0, deleted a master node + etcd without running kubeadm reset on the node first.

[check-etcd] Checking that the etcd cluster is healthy
I1229 20:40:28.493126    1674 local.go:75] [etcd] Checking etcd cluster health
I1229 20:40:28.493152    1674 local.go:78] creating etcd client that connects to etcd pods
I1229 20:40:28.511562    1674 etcd.go:107] etcd endpoints read from pods: https://116.203.251.18:2379,https://116.203.251.14:2379,https://116.203.251.13:2379
I1229 20:40:28.524072    1674 etcd.go:166] etcd endpoints read from etcd: https://116.203.251.13:2379,https://116.203.251.18:2379,https://116.203.251.14:2379
I1229 20:40:28.524128    1674 etcd.go:125] update etcd endpoints: https://116.203.251.13:2379,https://116.203.251.18:2379,https://116.203.251.14:2379
I1229 20:40:48.574818    1674 etcd.go:388] Failed to get etcd status for https://116.203.251.14:2379: failed to dial endpoint https://116.203.251.14:2379 with maintenance client: context deadline exceeded

116.203.251.14 was deleted.

What is the current workaround for this issue? Editing kubeadm-config accordingly and restarting etcd + kube-api-server pods does not solve the issue.

@cgebe if I’m not wrong you have to delete the member using etcdctl

Btw: https://github.com/kubernetes/enhancements/pull/1380 is going to remove problems related to the kubeadm ClusterStatus getting stale

@mauilion

    - name: ETCDCTL_CERT
      value: /etc/kubernetes/pki/etcd/healthcheck-client.crt
    - name: ETCDCTL_KEY
      value: /etc/kubernetes/pki/etcd/healthcheck-client.key
    - name: ETCDCTL_ENDPOINTS

i’ve just send a PR that may be the first step in deprecating the /etcd/healthcheck-client*, btw. https://github.com/kubernetes/kubernetes/pull/81385

PTAL and do tell if you object on serving a HTTP probe on localhost. i do not see a problem with that, then again the security folks might come at us with “hey, Eve can now see the etcd metrics of Alice”. but well, if Eve gains access to Alices computer, she can do worse.

@fabriziopandini I think this bug still exists in the etcd preflight: to reiterate, if the endpoints passed to the the etcd client are stale and contain some members that no longer exist/no longer have etcd running, there is a chance, due to how the etcd client works, that the call to cli.Sync() will try to connect to the nonexistent endpoint and will error out. that error propagates up and halts the kubeadm process. I have encountered this in some etcd client code of my own and am seeing this happen with kubeadm intermittently too. The simple fix that I have used is:

  • iterate through the list of potential etcd cluster endpoints
  • create a client for just that endpoint, and check its health
  • IFF it is healthy, call cli.Sync() to ask that endpoint what the other cluster endpoints are

This prevents the edge case where we try to cli.Sync() with a node that doesn’t exist/isn’t healthy.

I will PR this unless anyone sees a problem with it, as it is safer in the long run and doesn’t preempt the other node reset work being done.

I was following the discussion in #1300 and now here, just wanted to throw in my support on this - everything @danbeaulieu pointed out is something we’ve noticed in our cluster management. Namely, we had a situation where we had a 3-node cluster, and for whatever reason one of the nodes died. AWS ASG brought a new one up, but it hung up on the situation described above.

Our fix (in our provisioner) was the same as what Dan described, when a new node joins, reconcile the membership with etcd and the ClusterStatus.

We need to:

  1. Document how to remove a node on which kubeadm reset cannot be executed for some reason. (Hardware breakage is a valid reason)

  2. Probably introduce kubeadm cleanup that would act like reset, but for other node.

WDYT?

When a node goes away there is no guarantee that kubeadm reset will be called or if it is called that it will complete successfully (network partition etc).

If nodes terminated or shutdown, i use etcdctl to remove this etcd member. It could not use kubeadm. And if etcd cluster have two member, one of node goes away, the cluster not health, it maybe need human intervention.

I am not sure if i understand your words right?

@fabriziopandini in the reproduction steps kubeadm reset is not used, which is a real world scenario outside of intentional chaos engineering. When a node goes away there is no guarantee that kubeadm reset will be called or if it is called that it will complete successfully (network partition etc). So even if kubeadm reset does the right thing, it can’t be relied on in 100% of cases.

Kubeadm should be resilient to those scenarios. This issue may be a good place to track either the hacky workaround (order the map? retry etcd client explicitly?) or the integration of the patched etcd client when it is available.