etcd: etcd go client fails when querying a cluster with a down node

Describe the bug The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable.

To Reproduce I setup an etcd cluster with http (port 2378) and https (port 2379) listeners. Then used the etcd go client library to query the cluster. Then I took down the first unit listed when the client was established (in my case the client on port 10.53.82.119). The http go client continues to work but the https one fails.

https client:

	ctx, _ := context.WithTimeout(context.Background(), requestTimeout)

	cfg := clientv3.Config{
		Endpoints:   []string{"https://10.53.82.119:2379" ,"https://10.53.82.150:2379", "https://10.53.82.157:2379"},
		DialTimeout: 5 * time.Second,
	}

	cert := "/home/liam/tls_vault_certs/etcd-cert.pem"
	key := "/home/liam/tls_vault_certs/etcd.key"
	ca := "/home/liam/tls_vault_certs/etcd-ca.pem"
	tls := transport.TLSInfo{
		TrustedCAFile: ca,
		CertFile:      cert,
		KeyFile:       key,
	}

	tlscfg, err := tls.ClientConfig()
	cfg.TLS = tlscfg

	cli, err := clientv3.New(cfg)
	if err != nil {
		log.Fatal(err)
	}
	defer cli.Close()
	kv := clientv3.NewKV(cli)

Fails with: 2018-07-21 11:34:05.728613 I | context deadline exceeded

http client:

        ctx, _ := context.WithTimeout(context.Background(), requestTimeout)

        cfg := clientv3.Config{
                Endpoints:   []string{"http://10.53.82.119:2378" ,"http://10.53.82.150:2378", "http://10.53.82.157:2378"},
                DialTimeout: 5 * time.Second,
        }

        cli, err := clientv3.New(cfg)
        if err != nil {
                log.Fatal(err)
        }
        defer cli.Close()
        kv := clientv3.NewKV(cli)

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 19
Comments: 21 (7 by maintainers)

Commits related to this issue

Update mtest to avoid known bug The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable. https://github.com/etcd-... — committed to cybozu-go/cke by deleted user 6 years ago
Update mtest to avoid known bug The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable. https://github.com/etcd-... — committed to cybozu-go/cke by deleted user 6 years ago
Upgrade etcd dependency to v3.3.15 Contains an important fix in clientv3 that allows vault to successfully failover to another etcdv3 endpoint in the event that the current active connection becomes ... — committed to jsok/vault by jsok 5 years ago

Most upvoted comments

Just discussed with gRPC team, and got some good feedback. I will rework on this in the next few weeks.

gyuho on Jun 13, 2019

We should really get an update on this for k8s - v1.15

timothysc on May 15, 2019

Any updates on this? We’re still running into this with Kubernetes v1.15.0 and etcd 3.3.13

Protopopys on Jul 2, 2019

@xiang90 @jpbetz I can reproduce this. Let me see if I can fix this in etcd client side.

gyuho on Feb 19, 2019

/cc @gyuho @jpbetz

@jsok Is the TLS config in etcd client side or in the gRPC side? Can we switch to use a fresh config if the endpoint is different from the previous through balancer? Can you take look if we can fix this problem from etcd client side?

xiang90 on Feb 6, 2019

The fix https://github.com/etcd-io/etcd/pull/10911 has been merged, and will be released in 3.4 https://github.com/etcd-io/etcd/issues/10943.

gyuho on Jul 27, 2019

FWIW, we work around this problem by placing TCP reverse proxies on each node that connect to etcd. Each client can connect etcd via localhost:12379. Since TLS certificates of etcd servers have “localhost” SAN and “127.0.0.1” IP-SAN, the problem can be avoided.

A possible better workaround would be to place TLS-terminating TCP reverse proxies. That is, it terminates TLS both for client connections and for etcd servers while validating server certificates with their public IP addresses.

ymmt2005 on Jul 17, 2019