Flatcar: systemd-resolved randomly failing on 2765.2.1
Description
After upgrade to the 2765.2.1 and enabling systemd-resolved we have random resolution errors. We use consul
to run node checks. I found that after recent upgrade many network related checks starts to flapping. After i analyzed these shell script i found that failure is related to DNS and binded to systemd-resolved
. I was able to see error in the logs after setting resolvectl log-level debug
. Here is a relevant fragment (see {"error":"io.systemd.System","parameters":{"errno":22}
)
Error happens once in a few minutes but not on every request what causing flapping.
Upstream ticket: https://github.com/systemd/systemd/issues/19118
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: n/a: New incoming connection.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: n/a: Connections of user 499: 0 (of 1024 max)
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: varlink-14: varlink: setting state idle-server
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: varlink-14: New incoming message: {"method":"io.systemd.Resolve.ResolveHostname","parameters":{"name":"collector.sysdigcloud.com"}}
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: varlink-14: varlink: changing state idle-server → processing-method
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: idn2_lookup_u8: collector.sysdigcloud.com → collector.sysdigcloud.com
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Looking up RR for collector.sysdigcloud.com IN A.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Looking up RR for collector.sysdigcloud.com IN AAAA.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Removing cache entry for collector-ss-nlb-prod.sysdigcloud.com IN A (expired 0s ago)
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Removing cache entry for collector.sysdigcloud.com IN CNAME (expired 0s ago)
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Cache miss for collector.sysdigcloud.com IN A
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Transaction 37728 for <collector.sysdigcloud.com IN A> scope dns on eth0/*.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Using feature level UDP+EDNS0 for transaction 37728.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Using DNS server 10.213.10.2 for transaction 37728.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Sending query packet with id 37728 of size 54.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Cache miss for collector.sysdigcloud.com IN AAAA
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Transaction 7383 for <collector.sysdigcloud.com IN AAAA> scope dns on eth0/*.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Using feature level UDP+EDNS0 for transaction 7383.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Using DNS server 10.213.10.2 for transaction 7383.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Sending query packet with id 7383 of size 54.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: varlink-14: varlink: changing state processing-method → pending-method
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Processing incoming packet on transaction 7383 (rcode=SUCCESS).
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Not caching zero TTL cache entry: collector.sysdigcloud.com IN CNAME
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Transaction 7383 for <collector.sysdigcloud.com IN AAAA> on scope dns on eth0/* now complete with <success> from network (unsigned).
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Processing incoming packet on transaction 37728 (rcode=SUCCESS).
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Added positive unauthenticated cache entry for collector.sysdigcloud.com IN CNAME 60s on eth0/INET/10.213.10.2
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Added positive unauthenticated cache entry for collector-ss-nlb-prod.sysdigcloud.com IN A 60s on eth0/INET/10.213.10.2
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Added positive unauthenticated cache entry for collector-ss-nlb-prod.sysdigcloud.com IN A 60s on eth0/INET/10.213.10.2
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Added positive unauthenticated cache entry for collector-ss-nlb-prod.sysdigcloud.com IN A 60s on eth0/INET/10.213.10.2
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Transaction 37728 for <collector.sysdigcloud.com IN A> on scope dns on eth0/* now complete with <success> from network (unsigned).
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Freeing transaction 7383.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: varlink-14: Sending message: {"error":"io.systemd.System","parameters":{"errno":22}}
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: varlink-14: varlink: changing state pending-method → idle-server
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: Freeing transaction 37728.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: varlink-14: Got POLLHUP from socket.
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: varlink-14: varlink: changing state idle-server → pending-disconnect
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: varlink-14: varlink: changing state pending-disconnect → processing-disconnect
Mar 24 07:29:15 worker-blue-11-114 systemd-resolved[1033560]: varlink-14: varlink: changing state processing-disconnect → disconnected
Impact
At the moment at least our monitoring is very unstable, most likely other services are impacted. I am planning to completely disable systemd-resolved as a temporary workaround.
Environment and steps to reproduce
It is possible to reproduce with original AMI, see https://github.com/kinvolk/Flatcar/issues/374#issuecomment-805843889 for the steps.
Expected behavior
No DNS errors.
Additional information
Failing scripts using curl and openssl to connect to the services. This is an error from the OpenSSL output:
139881895659328:error:2008F002:BIO routines:BIO_lookup_ex:system lib:crypto/bio/b_addr.c:726:Name or service not known
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 29 (8 by maintainers)
Yes, makes sense, and we document how to enable it for those that need split DNS.
We’ve also noticed some DNS issues with Flatcar 2765.2.1 and 2765.2.2. In our case, we disabled the DNSStubListener as we are using node local dns. We are also going to try to switch back nsswitch.conf back to the old one used in 2605.12.0 to see if that will solve it.
Thanks for the work, I’ll be back next week and have a look