etcd: etcd cluster fails to start when using DNS SRV discovery with non-TLS

I am running etcd (version 3.4.3) on Fedora CoreOS (version 30) using Podman.

When running etcd with no TLS and SRV discovery, the installation is failing because it doesn’t find _etcd-server-ssl entries. This should not fail since entries do not exist as TLS is not being used.

2019-10-31 09:53:30.575647 E | embed: couldn't resolve during SRV discovery (error querying DNS SRV records for _etcd-server-ssl lookup _etcd-server-ssl._tcp.libvirt.labs on 172.16.10.1:53: no such host)
2019-10-31 09:53:30.575892 C | etcdmain: error setting up initial cluster: error querying DNS SRV records for _etcd-server-ssl lookup _etcd-server-ssl._tcp.libvirt.labs on 172.16.10.1:53: no such host

It also fails on 3.4.2, 3.4.1 and 3.4.0. However, in 3.3.17 it is working properly (see table blelow) but I don’t see any change in 3.4 changelog that forces to use TLS when SRV discovery is enabled. Is this the correct behaviour in 3.4?

+--------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|            ENDPOINT            |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://etcd1.libvirt.labs:2379 | ceb796e1dfaeb27e |  3.3.17 |   20 kB |      true |      false |        11 |          9 |                  0 |        |
| http://etcd2.libvirt.labs:2379 | b8dfd5ef2d30984a |  3.3.17 |   20 kB |     false |      false |        11 |          9 |                  0 |        |
| http://etcd3.libvirt.labs:2379 | dde9feb56ac9a7ad |  3.3.17 |   20 kB |     false |      false |        11 |          9 |                  0 |        |
+--------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

The first member of the cluster is getting started with:

ETCD_UUID="5d9701ad-6c02-4f64-b614-1e4561c29181" # $(uuidgen)
ETCD_VERSION="v3.4.3"
ETCD_NODE_NAME="$(hostname -s)"
ETCD_NODE_CLIENT_ADVERTISE_URL="http://$(hostname | cut -d' ' -f1):2379"
ETCD_NODE_SERVER_ADVERTISE_URL="http://$(hostname | cut -d' ' -f1):2380"
ETCD_NODE_CLIENT_LISTEN_URL="http://$(hostname -I | cut -d' ' -f1):2379"
ETCD_NODE_SERVER_LISTEN_URL="http://$(hostname -I | cut -d' ' -f1):2380"
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_DNS_SRV_DOMAIN="$(dnsdomainname)"

mkdir -p ${ETCD_DATA_DIR}

podman run \
  --name etcd \
  --volume ${ETCD_DATA_DIR}:/etcd-data:z \
  --net=host \
  quay.io/coreos/etcd:${ETCD_VERSION} \
    /usr/local/bin/etcd \
      --name ${ETCD_NODE_NAME} \
      --data-dir /etcd-data \
      --initial-cluster-state new \
      --initial-cluster-token ${ETCD_UUID} \
      --discovery-srv ${ETCD_DNS_SRV_DOMAIN} \
      --advertise-client-urls ${ETCD_NODE_CLIENT_ADVERTISE_URL} \
      --initial-advertise-peer-urls ${ETCD_NODE_SERVER_ADVERTISE_URL} \
      --listen-client-urls ${ETCD_NODE_CLIENT_LISTEN_URL} \
      --listen-peer-urls ${ETCD_NODE_SERVER_LISTEN_URL}

The DNS SRV entries for etcd cluster are:

$ dig +noall +answer SRV _etcd-server._tcp.libvirt.labs _etcd-client._tcp.libvirt.labs
_etcd-server._tcp.libvirt.labs.	0 IN	SRV	0 0 2380 etcd1.libvirt.labs.
_etcd-server._tcp.libvirt.labs.	0 IN	SRV	0 0 2380 etcd3.libvirt.labs.
_etcd-server._tcp.libvirt.labs.	0 IN	SRV	0 0 2380 etcd2.libvirt.labs.
_etcd-client._tcp.libvirt.labs.	0 IN	SRV	0 0 2379 etcd3.libvirt.labs.
_etcd-client._tcp.libvirt.labs.	0 IN	SRV	0 0 2379 etcd2.libvirt.labs.
_etcd-client._tcp.libvirt.labs.	0 IN	SRV	0 0 2379 etcd1.libvirt.labs.

The DNS A entries for etcd cluster are:

$ dig +noall +answer etcd1.libvirt.labs etcd2.libvirt.labs etcd3.libvirt.labs
etcd1.libvirt.labs.	0	IN	A	172.16.10.49
etcd2.libvirt.labs.	0	IN	A	172.16.10.188
etcd3.libvirt.labs.	0	IN	A	172.16.10.36

Etcd version is:

etcd Version: 3.4.3
Git SHA: 3cf2f69b5
Go Version: go1.12.12
Go OS/Arch: linux/amd64

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 8
  • Comments: 15 (4 by maintainers)

Commits related to this issue

Most upvoted comments

According to the code, GetDNSClusterNames is supposed to try both. However, that function returns the error from the etcd-server-ssl lookup if it fails, ignoring the fact that the etcd-server lookup was successful and clusterStrs contains valid addresses. Unfortunately, PeerURLsMapAndToken sees the error from the failed tls lookup and returns early.

This seems to have been broken here: https://github.com/etcd-io/etcd/commit/b664b9176c78ea15d5fc026354d87017dfc83c20 - the tests don’t catch it because they hardcode the SRV result set, without actually testing the record that they come from (_etcd-server-ssl._tcp.example.com for the https scheme or _etcd-server._tcp.example.com for http)