etcd: etcd cluster fails to start when using DNS SRV discovery with non-TLS
I am running etcd (version 3.4.3) on Fedora CoreOS (version 30) using Podman.
When running etcd with no TLS and SRV discovery, the installation is failing because it doesn’t find _etcd-server-ssl
entries. This should not fail since entries do not exist as TLS is not being used.
2019-10-31 09:53:30.575647 E | embed: couldn't resolve during SRV discovery (error querying DNS SRV records for _etcd-server-ssl lookup _etcd-server-ssl._tcp.libvirt.labs on 172.16.10.1:53: no such host)
2019-10-31 09:53:30.575892 C | etcdmain: error setting up initial cluster: error querying DNS SRV records for _etcd-server-ssl lookup _etcd-server-ssl._tcp.libvirt.labs on 172.16.10.1:53: no such host
It also fails on 3.4.2, 3.4.1 and 3.4.0. However, in 3.3.17 it is working properly (see table blelow) but I don’t see any change in 3.4 changelog that forces to use TLS when SRV discovery is enabled. Is this the correct behaviour in 3.4?
+--------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://etcd1.libvirt.labs:2379 | ceb796e1dfaeb27e | 3.3.17 | 20 kB | true | false | 11 | 9 | 0 | |
| http://etcd2.libvirt.labs:2379 | b8dfd5ef2d30984a | 3.3.17 | 20 kB | false | false | 11 | 9 | 0 | |
| http://etcd3.libvirt.labs:2379 | dde9feb56ac9a7ad | 3.3.17 | 20 kB | false | false | 11 | 9 | 0 | |
+--------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
The first member of the cluster is getting started with:
ETCD_UUID="5d9701ad-6c02-4f64-b614-1e4561c29181" # $(uuidgen)
ETCD_VERSION="v3.4.3"
ETCD_NODE_NAME="$(hostname -s)"
ETCD_NODE_CLIENT_ADVERTISE_URL="http://$(hostname | cut -d' ' -f1):2379"
ETCD_NODE_SERVER_ADVERTISE_URL="http://$(hostname | cut -d' ' -f1):2380"
ETCD_NODE_CLIENT_LISTEN_URL="http://$(hostname -I | cut -d' ' -f1):2379"
ETCD_NODE_SERVER_LISTEN_URL="http://$(hostname -I | cut -d' ' -f1):2380"
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_DNS_SRV_DOMAIN="$(dnsdomainname)"
mkdir -p ${ETCD_DATA_DIR}
podman run \
--name etcd \
--volume ${ETCD_DATA_DIR}:/etcd-data:z \
--net=host \
quay.io/coreos/etcd:${ETCD_VERSION} \
/usr/local/bin/etcd \
--name ${ETCD_NODE_NAME} \
--data-dir /etcd-data \
--initial-cluster-state new \
--initial-cluster-token ${ETCD_UUID} \
--discovery-srv ${ETCD_DNS_SRV_DOMAIN} \
--advertise-client-urls ${ETCD_NODE_CLIENT_ADVERTISE_URL} \
--initial-advertise-peer-urls ${ETCD_NODE_SERVER_ADVERTISE_URL} \
--listen-client-urls ${ETCD_NODE_CLIENT_LISTEN_URL} \
--listen-peer-urls ${ETCD_NODE_SERVER_LISTEN_URL}
The DNS SRV entries for etcd cluster are:
$ dig +noall +answer SRV _etcd-server._tcp.libvirt.labs _etcd-client._tcp.libvirt.labs
_etcd-server._tcp.libvirt.labs. 0 IN SRV 0 0 2380 etcd1.libvirt.labs.
_etcd-server._tcp.libvirt.labs. 0 IN SRV 0 0 2380 etcd3.libvirt.labs.
_etcd-server._tcp.libvirt.labs. 0 IN SRV 0 0 2380 etcd2.libvirt.labs.
_etcd-client._tcp.libvirt.labs. 0 IN SRV 0 0 2379 etcd3.libvirt.labs.
_etcd-client._tcp.libvirt.labs. 0 IN SRV 0 0 2379 etcd2.libvirt.labs.
_etcd-client._tcp.libvirt.labs. 0 IN SRV 0 0 2379 etcd1.libvirt.labs.
The DNS A entries for etcd cluster are:
$ dig +noall +answer etcd1.libvirt.labs etcd2.libvirt.labs etcd3.libvirt.labs
etcd1.libvirt.labs. 0 IN A 172.16.10.49
etcd2.libvirt.labs. 0 IN A 172.16.10.188
etcd3.libvirt.labs. 0 IN A 172.16.10.36
Etcd version is:
etcd Version: 3.4.3
Git SHA: 3cf2f69b5
Go Version: go1.12.12
Go OS/Arch: linux/amd64
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 8
- Comments: 15 (4 by maintainers)
Commits related to this issue
- Fix #11321 Fix issues identified in https://github.com/etcd-io/etcd/issues/11321#issuecomment-612172439 — committed to brandond/etcd by brandond 4 years ago
According to the code, GetDNSClusterNames is supposed to try both. However, that function returns the error from the
etcd-server-ssl
lookup if it fails, ignoring the fact that theetcd-server
lookup was successful and clusterStrs contains valid addresses. Unfortunately, PeerURLsMapAndToken sees the error from the failed tls lookup and returns early.This seems to have been broken here: https://github.com/etcd-io/etcd/commit/b664b9176c78ea15d5fc026354d87017dfc83c20 - the tests don’t catch it because they hardcode the SRV result set, without actually testing the record that they come from (
_etcd-server-ssl._tcp.example.com
for thehttps
scheme or_etcd-server._tcp.example.com
forhttp
)