coredns: Problems with latest 1.8.0

Hello Folks, last week we update our prod environments to 1.8.0 from 1.6.9 and unfortunately we had to rollback, I did try to debug the issue but since was prod I didn’t have enough time to get to the bottom of it. Here the symptoms:

  • DNS queries (A type) for entries with high number of endpoints (>200) taking ~35ms and very often failing and coreDNS logging the following error:
[ERROR] plugin/errors: 2 <MyDNS>. A: dns: overflow unpacking uint16
[ERROR] plugin/errors: 2 <MyOtherDNS>. A: dns: overflow unpacking uint32
  • 7X Number of DNS requests to the upsteam DNS server Screen Shot 2020-12-18 at 9 29 06 AM

  • DNS over TCP requests Screen Shot 2020-12-18 at 9 44 46 AM

I did a tcpdump but it looks inconclusive, but is is clear that each request is retransmitted: Screen Shot 2020-12-18 at 9 48 24 AM

this is the coreDNS config:

  Corefile: |-
    .:53 {
        bind 192.168.0.1
        errors
        health :8081 {
          lameduck 5s
        }
        kubernetes cluster.local. in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        loop
        cache 30
        loadbalance
        reload
    }
    my-internal.domain {
      bind 192.168.0.1
      errors
      cache 30 {
        prefetch 2 1m 20%
      }
      forward . 127.0.0.1:8600 172.30.234.20:8600 {
        policy sequential
      }
    }

I gave a look at changes from 1.6.9 and 1.8.0 and i noticed some changes/fix on EDNS0 that’s my main suspect atm. I tried to reproduce the issue in staging without luck, so the QPS seems to be another factor.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 43 (21 by maintainers)

Most upvoted comments

[ Quoting notifications@github.com in “Re: [coredns/coredns] Problems with…” ]

@chrisohaver for what I understood, i think the problem is that coredns forwarding the request to consul doesn’t respect the UDP opt size of 4096 received from the client and overwrite it to 2048.

that is very likely not the issue