coredns: CoreDNS resolution failure for external hostnames with "overflow unpacking uint32"

First of all forgive me if this is not the right place to post. I’m using coredns (in a k8s/rancher installation). Rancher uses its own fork of coredns, but looking at their repository it seems to be the same code of coredns itself.

Here’s my original issue: rancher/rke/issues/1662 but I think that I’ve posted it in the wrong place.

Their rke tool install CoreDNS-1.3.1 with this configuration

.:53 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
      pods insecure
      upstream
      fallthrough in-addr.arpa ip6.arpa
      ttl 30
    }
    prometheus :9153
    forward . '/etc/resolv.conf'
    cache 30
    loop
    reload
    loadbalance
}

Log start:

.:53
2019-09-24T13:54:37.187Z [INFO] CoreDNS-1.3.1
2019-09-24T13:54:37.187Z [INFO] linux/amd64, go1.11.4, 6b56a9c
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c

I’ve installed a rancher cluster using RKE with 3 nodes on KVM, until now everything worked well, but starting today I have an (intermittent) issue in DNS names resolution in my pods. It happens only for some hosts and randomly.

When it happens dns resolution (for external names) in pods is not working and I get some nasty errors in coredns pod, here are some examples:

2019-09-24T12:46:25.111Z [INFO] plugin/reload: Running configuration MD5 = 45cd9f91917cc54711e243e0d08537a7
2019-09-24T12:47:27.474Z [ERROR] plugin/errors: 2 security.ubuntu.com. A: dns: overflow unpacking uint32
2019-09-24T12:47:32.475Z [ERROR] plugin/errors: 2 security.ubuntu.com. A: dns: overflow unpacking uint32
2019-09-24T12:47:37.476Z [ERROR] plugin/errors: 2 security.ubuntu.com. A: dns: overflow unpacking uint32
2019-09-24T13:12:39.537Z [ERROR] plugin/errors: 2 registry.npmjs.org. A: dns: overflow unpacking uint32
2019-09-24T13:12:39.549Z [ERROR] plugin/errors: 2 registry.npmjs.org. AAAA: dns: overflow unpacking uint16
2019-09-24T13:12:44.539Z [ERROR] plugin/errors: 2 registry.npmjs.org. AAAA: dns: overflow unpacking uint16
2019-09-24T13:12:44.543Z [ERROR] plugin/errors: 2 registry.npmjs.org. A: dns: overflow unpacking uint32   

I’ve enabled coredns logs directive in configmap to have verbose logging and I get this:

2019-09-24T13:13:56.246Z [INFO] 10.42.0.230:56169 - 21605 "A IN registry.npmjs.org. udp 36 false 512" SERVFAIL qr,rd 36 5.003233501s
2019-09-24T13:13:56.246Z [ERROR] plugin/errors: 0 registry.npmjs.org. A: dns: overflow unpacking uint32
2019-09-24T13:13:56.251Z [INFO] 10.42.0.230:56169 - 4205 "AAAA IN registry.npmjs.org. udp 36 false 512" SERVFAIL qr,rd 36 5.008031338s
2019-09-24T13:13:56.251Z [ERROR] plugin/errors: 0 registry.npmjs.org. AAAA: dns: overflow unpacking uint16

Finally I also changed this directive in Corefile

forward . "/etc/resolv.conf"
# replaced with
forward . 1.1.1.1
# also tried with
forward . 8.8.8.8

Nothing changes.

I can find nothing relevant by googling for message “overflow unpacking uint32”, just some code fragment where is triggered.

My coredns deplyoment uses rancher/coredns-coredns:1.3.1 as image.

What can be?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 22 (7 by maintainers)

Most upvoted comments

Commenting in case somebody else needs it. I’m also using a Mikrotik router.

I noticed this while troubleshooting Nextcloud News. There’s a scheduled run that makes a request to a bunch of URLs for RSS updates, and some but not all were failing. Looking at the CoreDNS logs gave many similar entries:

[ERROR] plugin/errors: 2 security.archlinux.org. AAAA: dns: overflow unpacking uint16                                       
[ERROR] plugin/errors: 2 opsgenie.status.atlassian.com. A: dns: overflow unpacking uint16                                   

On my Mikrotik Max UDP Packet Size defaults to 4096. I’ve reduced mine to 512. No errors in logs, and feeds have updated successfully.

Had simillar issue, but another message:

[ERROR] plugin/errors: 2 production.cloudflare.docker.com. A: dns: overflowing header size

Increasing the default value of 4096 to 8192 max UDP packet size in my Mikrotik seems solved this

在我的 Mikrotik 上Max UDP Packet Size默认为 4096。我已将我的减少到 512。日志中没有错误,并且提要已成功更新。

image

a date for the next has not yet been planned

We recently started bumping into this problem, in particular when querying for AWS S3 hostnames in the format <bucketName>.s3.us-west-1.amazonaws.com. I assume the reason is that we get back an answer with the CNAME and also a lot of A records for the CNAME, overflowing the buffer. Glad to hear a patch will be released sometime!

In the meantime, (in case anyone else is arriving at this page from web searches as I did), we worked around it by configuring our k8s pods to prefer TCP DNS resolution. You just need to add the use-vc option to your dnsConfig.options like this:

spec:
  # …
  dnsConfig:
    options:
      - name: use-vc

why sould my entire network adapt to that limit when only coreDNS can’t handle that UDP package size ?. I did that but i don’t see it realy.

Commentary above seems to suggest that this is a problem with an upstream DNS server creating an illegal overflowed response, and upon receiving these malformed responses, CoreDNS justifiably rejects them.

Would it be nice if CoreDNS could work around these upstream bugs? Yes, there is workaround patch that aims to fix this sort of issue already merged and to be included in the next release (a date for the next has not yet been planned).

在我的 Mikrotik 上Max UDP Packet Size默认为 4096。我已将我的减少到 512。日志中没有错误,并且提要已成功更新。

image

Reducing the Max UDP packet size on my router solved the problem.

docker itself, you mean the Go code in docker? Then they are doing it (somewhat) wrong.

Yes, it should be go code in docker. I was able to reproduce the issue outside kubernetes and coredns.

I get a similar issue by using a plain docker pull hello-world command in the same machine where coredns is running and also in another machine in the same network. Here’s the output:

docker pull hello-world
Error while pulling image: Get https://index.docker.io/v1/repositories/library/hello-world/images: dial tcp: lookup index.docker.io on 8.8.4.4:53: cannot unmarshal DNS message

If I do a dig/curl/nslookup for index.docker.io everything works, so it must be something inside the go code that causes this issue.

The upstream sending back this response is not compliant with the DNS standard. It should not send back such a response.

This is very unlikely, I changed upstream servers with both 8.8.8.8 (google) and 1.1.1.1 (cloudflare) and the issue persists. Also is strange that Google’s dns sends a malformed response.

I also tried another approach: I installed dnsmasq locally using 8.8.8.8 as upstream server, then I’ve updated local /etc/resolv.conf with nameserver 127.0.0.1 and everything worked, I’m able to pull images from registries and docker works.

I’m starting to think that the issue could be in the LAN itself, main gateway (from Microtik) which runs RouterOS could have some issues with EDNS. I found these resources (even if they are pretty old) on the web, so I’ll point my investigation in that direction.

https://www.dns-oarc.net/oarc/services/replysizetest https://forum.mikrotik.com/viewtopic.php?t=46227

You can maybe work around it by setting force_tcp option in the forward stanza. tl;dr: various pieces seem to violate the DNS standard. I think coredns is doing the right thing here and I’m reluctant to add (non-standard) work arounds

My attempts clearly shows that coredns is not involved in this issue, so I think this could be closed. I’ll try to use force_tcp as temporary workaround until I don’t find the actual issue with my setup.

Thanks for your support any way.

P.S. in order to continue my investigations, since I’m not so expert in DNS protocol could you please tell me how can I distinguish EDNS requests and responses in PCAP dumps?