go: net: retry DNS lookups before failure?
I’ve frequently noticed that our net DNS tests running on builders are often flaky.
For example:
https://build.golang.org/log/ce5a87135d1a5ed4f17bd998ace2e0060b9ad597 https://build.golang.org/log/b3e762fc83d463acba21987ff558c8018b33c7cb https://build.golang.org/log/250fc567590d125f1c8fd27740105eb7288ab16c
--- FAIL: TestLookupDotsWithRemoteSource (5.05s)
lookup_test.go:566: LookupSRV(xmpp-server, tcp, google.com): lookup _xmpp-server._tcp.google.com on 8.8.8.8:53: no such host (mode=go)
--- FAIL: TestLookupDotsWithRemoteSource (5.46s)
lookup_test.go:540: LookupMX(google.com): lookup google.com on 8.8.8.8:53: no such host (mode=cgo)
FAIL
FAIL net 7.838s
--- FAIL: TestLookupGmailNS (5.01s)
lookup_test.go:142: lookup gmail.com. on 8.8.8.8:53: dial udp 8.8.8.8:53: i/o timeout
FAIL
FAIL net 7.336s
etc.
Notice they’re all after 5 seconds. (our default DNS timeout)
Did a UDP request get lost?
Did a UDP response get lost?
Does NAT make some builders worse?
Should we make builders re-try all DNS tests N times?
But this is also flaky (but to a much lesser degree) on my desktop on wired ethernet. With 500 runs, I still see occasional failures.
Maybe we should make our net package’s DNS code automatically resend the UDP request after half the timeout? (i.e. after 2.5 seconds by default)
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 20 (16 by maintainers)
I’m okay with us changing the DNS resolver logic to more closely match other DNS client libraries if that helps the flakiness, but I’m hesitant to do things like change default timeouts / retry logic just to appease flaky tests.
A possible testing-side fix: we could run a simple local DNS server that just knows how to respond to certain fixed DNS queries. It doesn’t even need to implement proper DNS packet decoding. It just needs to copy the 16-bit query ID at the start of the packet, and then do an exact byte-string lookup on the rest to decide on a response.