etcd: etcd go client fails when querying a cluster with a down node

Describe the bug The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable.

To Reproduce I setup an etcd cluster with http (port 2378) and https (port 2379) listeners. Then used the etcd go client library to query the cluster. Then I took down the first unit listed when the client was established (in my case the client on port 10.53.82.119). The http go client continues to work but the https one fails.

https client:

	ctx, _ := context.WithTimeout(context.Background(), requestTimeout)

	cfg := clientv3.Config{
		Endpoints:   []string{"https://10.53.82.119:2379" ,"https://10.53.82.150:2379", "https://10.53.82.157:2379"},
		DialTimeout: 5 * time.Second,
	}

	cert := "/home/liam/tls_vault_certs/etcd-cert.pem"
	key := "/home/liam/tls_vault_certs/etcd.key"
	ca := "/home/liam/tls_vault_certs/etcd-ca.pem"
	tls := transport.TLSInfo{
		TrustedCAFile: ca,
		CertFile:      cert,
		KeyFile:       key,
	}

	tlscfg, err := tls.ClientConfig()
	cfg.TLS = tlscfg

	cli, err := clientv3.New(cfg)
	if err != nil {
		log.Fatal(err)
	}
	defer cli.Close()
	kv := clientv3.NewKV(cli)

Fails with: 2018-07-21 11:34:05.728613 I | context deadline exceeded

http client:

        ctx, _ := context.WithTimeout(context.Background(), requestTimeout)

        cfg := clientv3.Config{
                Endpoints:   []string{"http://10.53.82.119:2378" ,"http://10.53.82.150:2378", "http://10.53.82.157:2378"},
                DialTimeout: 5 * time.Second,
        }

        cli, err := clientv3.New(cfg)
        if err != nil {
                log.Fatal(err)
        }
        defer cli.Close()
        kv := clientv3.NewKV(cli)

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 19
  • Comments: 21 (7 by maintainers)

Commits related to this issue

Most upvoted comments

Just discussed with gRPC team, and got some good feedback. I will rework on this in the next few weeks.

We should really get an update on this for k8s - v1.15

Any updates on this? We’re still running into this with Kubernetes v1.15.0 and etcd 3.3.13

@xiang90 @jpbetz I can reproduce this. Let me see if I can fix this in etcd client side.

/cc @gyuho @jpbetz

@jsok Is the TLS config in etcd client side or in the gRPC side? Can we switch to use a fresh config if the endpoint is different from the previous through balancer? Can you take look if we can fix this problem from etcd client side?

FWIW, we work around this problem by placing TCP reverse proxies on each node that connect to etcd. Each client can connect etcd via localhost:12379. Since TLS certificates of etcd servers have “localhost” SAN and “127.0.0.1” IP-SAN, the problem can be avoided.

A possible better workaround would be to place TLS-terminating TCP reverse proxies. That is, it terminates TLS both for client connections and for etcd servers while validating server certificates with their public IP addresses.