kubernetes: kube-apiserver 1.13.x refuses to work when first etcd-server is not available.

How to reproduce the problem: Set up a new demo cluster with kubeadm 1.13.1. Create default configurationwith kubeadm config print init-defaults Initialize cluster as usual with kubeadm init

Change the --etcd-servers list in kube-apiserver manifest to --etcd-servers=https://127.0.0.2:2379,https://127.0.0.1:2379, so that the first etcd node is unavailable (“connection refused”).

The kube-apiserver is then not able to connect to etcd any more.

Last message: Unable to create storage backend: config (\u0026{ /registry [https://127.0.0.2:2379 https://127.0.0.1:2379] /etc/kubernetes/pki/apiserver-etcd-client.key /etc/kubernetes/pki/apiserver-etcd-client.crt /etc/kubernetes/pki/etcd/ca.crt true 0xc000381dd0 \u003cnil\u003e 5m0s 1m0s}), err (dial tcp 127.0.0.2:2379: connect: connection refused)\n","stream":"stderr","time":"2018-12-17T12:13:19.608822816Z"}

kube-apiserver does not start.

If I upgrade etcd to version 3.3.10, it reports an error remote error: tls: bad certificate", ServerName ""

Environment:

  • Kubernetes version 1.13.1
  • kubeadm in Vagrant box

I also experience this bug in an environment with a real etcd cluster.

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 33
  • Comments: 68 (42 by maintainers)

Commits related to this issue

Most upvoted comments

I was able to repro this issue with the repro steps provided by @Cytrian. I also reproduced this issue with a real etcd cluster.

As @JishanXing previously mentioned, the problem is caused by a bug in the etcd v3 client library (or perhaps the grpc library). The vault project is also running into this: https://github.com/hashicorp/vault/issues/4349

The problem seems to be that the etcd library uses the first node’s address as the ServerName for TLS. This means that all attempts to connect to any server other than the first will fail with a certificate validation error (i.e. cert has ${nameOfNode2} in SANs, but the client is expecting ${nameOfNode1}).

An important thing to highlight is that when the first etcd server goes down, it also takes the Kubernetes API servers down, because they fail to connect to the remaining etcd servers.

With that said, this all depends on what your etcd server certificates look like:

  • If you follow the kubeadm instructions to stand up a 3 node etcd cluster, you get a set of certificates that include the first node’s name and IP in the SANs (because all certs are generated on the first etcd node). Thus, you should not run into this issue.
  • If you have used another process to generate certificates for etcd, and the certs do not include the first node’s name and IP in the SANs, you will most likely run into this issue when the first etcd node goes down.

To reproduce the issue with a real etcd cluster:

  1. Create a 3 node etcd cluster with TLS enabled. Each certificate should only contain the name/IP of the node that will be serving it.
  2. Start an API server that points to the etcd cluster.
  3. Stop the first etcd node.
  4. API server crashes and fails to come back up

Versions:

  • kubeadm version: v1.13.2
  • kubernetes api server version: v1.13.2
  • etcd image: k8s.gcr.io/etcd:3.2.24

API server crash log: https://gist.github.com/alexbrand/ba86f506e4278ed2ada4504ab44b525b

I was unable to reproduce this issue with API server v1.12.5 (n.b. this was somewhat of a non-scientific test => tested by updating the image field of the API server static pod produced by kubeadm v1.13.2)

We have 3 master and 3 etcdservers, a workaround is to change the order of etcdservers. master0:

--etcd-servers=etcd0,etcd1,etcd2

master1:

--etcd-servers=etcd1,etcd0,etcd2

master2:

--etcd-servers=etcd2,etcd0,etcd1

@timothysc I just came back from trip. Will start working on this from this week! And post updates here.

@dims @jpbetz https://github.com/etcd-io/etcd/releases/tag/v3.3.14-beta.0 has been released with all the fixes. Please try. Once tests look good in the next few days, I will release v3.3.14.

Update: ~https://github.com/etcd-io/etcd/releases/tag/v3.3.14~ https://github.com/etcd-io/etcd/releases/tag/v3.3.15 has been released.

@liggitt that leaves people on 1.13-1.15 without proper H/A? I think this issue deserves to be fixed in the three supported releases of Kubernetes. The hotfix mentioned here looks simple enough to be added, but you are saying the proper fix will require much more. So this is all sorta obscure and confusing to the community, IMO.

Not that I am complaining, don’t get me wrong, it’s just I think everyone would welcome some clarity on this issue. Maybe document it somewhere and provide some workarounds for the people who are still on v1.13.-1.15? Because right now, bring down the first etcd member and oops, api is not working, cluster is not working.

@timothysc I will look into this.

@dims just built custom 1.14 image from the release-1.14 branch and patched that credentials.go file and it’s so much better now when I bring down the first etcd node down. Now I am confused, if the fix is so simple, why does it have to wait until 1.16?

Update: https://github.com/grpc/grpc-go/releases/tag/v1.23.0 is out. Bumping up gRPC https://github.com/etcd-io/etcd/pull/11029 in etcd master branch, in addition to Go runtime upgrade https://groups.google.com/forum/#!topic/golang-announce/65QixT3tcmg. Once tests look good, we will start working on 3.3 backports.

@igcherkaev We are planning to backport the fix to etcd 3.3 after etcd 3.4 release. Then, kubernetes can pick up the latest etcd 3.3.

Tried with Kubernetes 1.15.3 and with 1.16.2 but its not working with neither. This is not fixed even for IP addresses:

W1107 12:48:06.316691       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://172.17.8.202:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for 10.0.2.15, 127.0.0.1, ::1, 172.17.8.202, not 172.17.8.201". Reconnecting...
W1107 12:48:06.328186       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://172.17.8.203:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for 10.0.2.15, 127.0.0.1, ::1, 172.17.8.203, not 172.17.8.201". Reconnecting...

The issue is still there with Kubernetes 1.14.6 and etcd 3.3.15. Any changes to kubernetes libs or code needed to tackle this issue?

August 13, 2019 the earliest day the fix going to land on 3.3

Yes

@gyuho Can we please wrap up the backport (to etcd 3.3) within the next week or two? Please see timeline for 1.16 ( https://github.com/kubernetes/sig-release/tree/master/releases/release-1.16#timeline ) We need to have sufficient soak time for this change in k/k.

@liggitt @igcherkaev I will work on the documentation.

Is there a hotfix for v1.15?

Just discussed with gRPC team and @jingyih. We will rework on this in the next few weeks.

I think this was the issue more on how gRPC balancer does failover with credentials.

I’ve shared a workaround to fix this issue in upstream gRPC https://github.com/grpc/grpc-go/pull/2650.

And waiting for their feedback. /cc @xiang90 @jpbetz

No pod manifest involved here. Just a group of etcd and a kube-apiserver. The issue appeared when we rebooted the first etcd node.