coredns: Frequent DNS failures on ARM64 - DialWithDialer Timed Out

We’re running CoreDNS on ARM64, built with Go v1.10. Queries are served to CloudFlare encrypted DNS (1.1.1.1) which runs fine most of the time. However, they are frequently timing out. This always occurs in batches and after 5-10 seconds, the problem resolves itself. It can be several minutes before it occurs again. Possibly related, we’re seeing extended CPU spikes to 40%+.

` [/usr/local/go/src/net/dial.go:net.(*Resolver).resolveAddrList 208] err:%!(EXTRA <nil>)

[/usr/local/go/src/net/dial.go:net.(*Dialer).DialContext 390] [DEBUG] err:%!(EXTRA <nil>)

2018/11/02 19:45:38 [ERROR] 2 init.itunes.apple.com. A: EOF

[/usr/local/go/src/net/dial.go:net.(*Resolver).resolveAddrList 208] err:%!(EXTRA <nil>)

[/usr/local/go/src/net/dial.go:net.(*Dialer).DialContext 390] [DEBUG] err:%!(EXTRA <nil>)

[/usr/local/go/src/net/dial.go:net.(*Resolver).resolveAddrList 208] err:%!(EXTRA <nil>)

[/usr/local/go/src/net/dial.go:net.(*Dialer).DialContext 390] [DEBUG] err:%!(EXTRA <nil>)

2018/11/02 19:45:38 [ERROR] 2 cl4.apple.com. AAAA: tls: DialWithDialer timed out

2018/11/02 19:45:38 [ERROR] 2 gs-loc.apple.com. A: tls: DialWithDialer timed out

2018/11/02 19:45:38 [ERROR] 2 mesu.apple.com. A: tls: DialWithDialer timed out

2018/11/02 19:45:38 [ERROR] 2 gateway.icloud.com. A: tls: DialWithDialer timed out

2018/11/02 19:45:38 [ERROR] 2 wu-calculator.apple.com. A: tls: DialWithDialer timed out

2018/11/02 19:45:38 [ERROR] 2 cl5.apple.com. A: tls: DialWithDialer timed out

2018/11/02 19:45:38 [ERROR] 2 cl2.apple.com. A: tls: DialWithDialer timed out

2018/11/02 19:45:38 [ERROR] 2 init-p01md.apple.com. A: tls: DialWithDialer timed out

2018/11/02 19:45:38 [ERROR] 2 smp-device-content.apple.com. A: tls: DialWithDialer timed out

2018/11/02 19:45:38 [ERROR] 2 bag.itunes.apple.com. A: tls: DialWithDialer timed out

2018/11/02 19:45:38 [ERROR] 2 smp-device-content.apple.com. A: tls: DialWithDialer timed out

2018/11/02 19:45:38 [ERROR] 2 gspe35-ssl.ls.apple.com. A: tls: DialWithDialer timed out

2018/11/02 19:45:38 [ERROR] 2 gs-loc.apple.com. A: tls: DialWithDialer timed out

2018/11/02 19:45:38 [ERROR] 2 cl2.apple.com. A: tls: DialWithDialer timed out `

Our CoreFile:

.:53 { log stdout errors cache 300 hosts /etc/winston/hosts { fallthrough } forward . tls://1.1.1.1 { tls_servername cloudflare-dns.com health_check 5s } fallback SERVFAIL . 8.8.8.8:53 }

We’re not sure how to go about diagnosing this. Any suggestions welcome.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 57 (28 by maintainers)

Most upvoted comments

[ Quoting notifications@github.com in “Re: [coredns/coredns] Frequent DNS …” ]

We’re running CoreDNS on a network device (similar to a router or pihole) attached to a home network.

We’re using the “hosts” plugin (see Corefile above) and our hosts file has 80,000+ lines in it.

By stampede, I mean that the first lookup takes a little while. Meanwhile, potentially dozens of other queries are lining up behind it waiting for it to fallthrough the hosts plugin. By the 10th lookup or so, we’ve crossed the 5 second timeout and all of the remaining queries are cancelled. The end result is that things work normally for most sites, but when we hit one which calls many different domains, we get a 5 second delay followed by dozens of DNS timeouts.

So this might be because the host plugin was never designed for hosts file with 80k lines. But regardless of that. I think this may happen when we re-read the file and take a writelock.