coredns: CoreDNS resolution failure for external hostnames with "overflow unpacking uint32"
First of all forgive me if this is not the right place to post. I’m using coredns (in a k8s/rancher installation). Rancher uses its own fork of coredns, but looking at their repository it seems to be the same code of coredns itself.
Here’s my original issue: rancher/rke/issues/1662 but I think that I’ve posted it in the wrong place.
Their rke tool install CoreDNS-1.3.1 with this configuration
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . '/etc/resolv.conf'
cache 30
loop
reload
loadbalance
}
Log start:
.:53
2019-09-24T13:54:37.187Z [INFO] CoreDNS-1.3.1
2019-09-24T13:54:37.187Z [INFO] linux/amd64, go1.11.4, 6b56a9c
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c
I’ve installed a rancher cluster using RKE with 3 nodes on KVM, until now everything worked well, but starting today I have an (intermittent) issue in DNS names resolution in my pods. It happens only for some hosts and randomly.
When it happens dns resolution (for external names) in pods is not working and I get some nasty errors in coredns pod, here are some examples:
2019-09-24T12:46:25.111Z [INFO] plugin/reload: Running configuration MD5 = 45cd9f91917cc54711e243e0d08537a7
2019-09-24T12:47:27.474Z [ERROR] plugin/errors: 2 security.ubuntu.com. A: dns: overflow unpacking uint32
2019-09-24T12:47:32.475Z [ERROR] plugin/errors: 2 security.ubuntu.com. A: dns: overflow unpacking uint32
2019-09-24T12:47:37.476Z [ERROR] plugin/errors: 2 security.ubuntu.com. A: dns: overflow unpacking uint32
2019-09-24T13:12:39.537Z [ERROR] plugin/errors: 2 registry.npmjs.org. A: dns: overflow unpacking uint32
2019-09-24T13:12:39.549Z [ERROR] plugin/errors: 2 registry.npmjs.org. AAAA: dns: overflow unpacking uint16
2019-09-24T13:12:44.539Z [ERROR] plugin/errors: 2 registry.npmjs.org. AAAA: dns: overflow unpacking uint16
2019-09-24T13:12:44.543Z [ERROR] plugin/errors: 2 registry.npmjs.org. A: dns: overflow unpacking uint32
I’ve enabled coredns logs directive in configmap to have verbose logging and I get this:
2019-09-24T13:13:56.246Z [INFO] 10.42.0.230:56169 - 21605 "A IN registry.npmjs.org. udp 36 false 512" SERVFAIL qr,rd 36 5.003233501s
2019-09-24T13:13:56.246Z [ERROR] plugin/errors: 0 registry.npmjs.org. A: dns: overflow unpacking uint32
2019-09-24T13:13:56.251Z [INFO] 10.42.0.230:56169 - 4205 "AAAA IN registry.npmjs.org. udp 36 false 512" SERVFAIL qr,rd 36 5.008031338s
2019-09-24T13:13:56.251Z [ERROR] plugin/errors: 0 registry.npmjs.org. AAAA: dns: overflow unpacking uint16
Finally I also changed this directive in Corefile
forward . "/etc/resolv.conf"
# replaced with
forward . 1.1.1.1
# also tried with
forward . 8.8.8.8
Nothing changes.
I can find nothing relevant by googling for message “overflow unpacking uint32”, just some code fragment where is triggered.
My coredns deplyoment uses rancher/coredns-coredns:1.3.1 as image.
What can be?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 22 (7 by maintainers)
Commenting in case somebody else needs it. I’m also using a Mikrotik router.
I noticed this while troubleshooting Nextcloud News. There’s a scheduled run that makes a request to a bunch of URLs for RSS updates, and some but not all were failing. Looking at the CoreDNS logs gave many similar entries:
On my Mikrotik
Max UDP Packet Size
defaults to 4096. I’ve reduced mine to 512. No errors in logs, and feeds have updated successfully.Had simillar issue, but another message:
Increasing the default value of
4096
to8192
max UDP packet size in myMikrotik
seems solved this在我的 Mikrotik 上Max UDP Packet Size默认为 4096。我已将我的减少到 512。日志中没有错误,并且提要已成功更新。
We recently started bumping into this problem, in particular when querying for AWS S3 hostnames in the format
<bucketName>.s3.us-west-1.amazonaws.com
. I assume the reason is that we get back an answer with the CNAME and also a lot of A records for the CNAME, overflowing the buffer. Glad to hear a patch will be released sometime!In the meantime, (in case anyone else is arriving at this page from web searches as I did), we worked around it by configuring our k8s pods to prefer TCP DNS resolution. You just need to add the use-vc option to your dnsConfig.options like this:
Commentary above seems to suggest that this is a problem with an upstream DNS server creating an illegal overflowed response, and upon receiving these malformed responses, CoreDNS justifiably rejects them.
Would it be nice if CoreDNS could work around these upstream bugs? Yes, there is workaround patch that aims to fix this sort of issue already merged and to be included in the next release (a date for the next has not yet been planned).
Reducing the Max UDP packet size on my router solved the problem.
Yes, it should be go code in docker. I was able to reproduce the issue outside kubernetes and coredns.
I get a similar issue by using a plain
docker pull hello-world
command in the same machine where coredns is running and also in another machine in the same network. Here’s the output:If I do a
dig/curl/nslookup
for index.docker.io everything works, so it must be something inside the go code that causes this issue.This is very unlikely, I changed upstream servers with both
8.8.8.8
(google) and1.1.1.1
(cloudflare) and the issue persists. Also is strange that Google’s dns sends a malformed response.I also tried another approach: I installed dnsmasq locally using
8.8.8.8
as upstream server, then I’ve updated local/etc/resolv.conf
withnameserver 127.0.0.1
and everything worked, I’m able to pull images from registries and docker works.I’m starting to think that the issue could be in the LAN itself, main gateway (from Microtik) which runs RouterOS could have some issues with EDNS. I found these resources (even if they are pretty old) on the web, so I’ll point my investigation in that direction.
https://www.dns-oarc.net/oarc/services/replysizetest https://forum.mikrotik.com/viewtopic.php?t=46227
My attempts clearly shows that coredns is not involved in this issue, so I think this could be closed. I’ll try to use
force_tcp
as temporary workaround until I don’t find the actual issue with my setup.Thanks for your support any way.
P.S. in order to continue my investigations, since I’m not so expert in DNS protocol could you please tell me how can I distinguish EDNS requests and responses in PCAP dumps?