coredns: Forward plugin sometimes returns FORMERR

Hi there,

we’re using coredns in our kubernetes clusters as the resolver. We have a rather basic setup:

    . {
      errors
      health
      prometheus 0.0.0.0:9153
      log . "{remote}:{port} - [{when}] {>id} {type} {class} {name} {proto} {size} {>do} {>bufsize} {rcode} {>rflags} {rsize} {duration}" {
        class error
      }
      cache 30
      kubernetes tt-k8s1.ko.seznam.cz 172.16.0.0/12 {
        tls /coredns/tls/coredns.pem /coredns/tls/coredns-key.pem /coredns/tls/ca.pem
        upstream 10.250.20.222 10.250.50.222
      }
      forward . 10.250.20.222 10.250.50.222
    }

with 10.250.20.222 and 10.250.50.222 being our internal DNS servers handling recursive DNS queries for the rest of the domains, including internal domains as well as all outgoing queries.

We’re using coredns 1.1.2 at the moment.

Sometimes FORMERR errors keep showing up for a few minutes, and they disappear after less than 10 minutes. They show up like this in the log:

10.64.38.104:59837 - [27/Jun/2018:12:13:14 +0200] 56625 A IN www.mapy.cz. udp 29 false 512 FORMERR qr,rd 29 205.372µs
10.66.206.49:37658 - [27/Jun/2018:12:12:51 +0200] 31387 AAAA IN grafana.com. udp 29 false 512 FORMERR qr,rd 29 219.374µs

It seems that the requests are served from cache because of the µs latencies - if they would try to connect to the upstream DNS servers, it would take at least a millisecond.

I managed to do a raw dig for a failing domain when the issue was happening:

$ dig puzzle.ng.seznam.cz @10.65.142.169

; <<>> DiG 9.10.3-P4-Ubuntu <<>> puzzle.ng.seznam.cz @10.65.142.169
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 46786
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; WARNING: EDNS query returned status FORMERR - retry with '+noedns'

;; QUESTION SECTION:
;puzzle.ng.seznam.cz.		IN	A

;; Query time: 1 msec
;; SERVER: 10.65.142.169#53(10.65.142.169)
;; WHEN: Wed Jun 27 12:12:56 CEST 2018
;; MSG SIZE  rcvd: 37

and I tried a subsequent query with +noedns:

$ dig +noedns puzzle.ng.seznam.cz @10.65.142.169

; <<>> DiG 9.10.3-P4-Ubuntu <<>> +noedns puzzle.ng.seznam.cz @10.65.142.169
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 47898
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;puzzle.ng.seznam.cz.		IN	A

;; Query time: 1 msec
;; SERVER: 10.65.142.169#53(10.65.142.169)
;; WHEN: Wed Jun 27 12:13:05 CEST 2018
;; MSG SIZE  rcvd: 37

I’m 100% certain the upstream DNS resolvers were working correctly, because I tried issuing the same query and they returned the answer correctly.

I also don’t see anything else in the log, just the FORMERR errors. I would expect to see some upstream connection errors if coredns would expect the upstream servers to be unavailable.

I’ve only seen this happen to a single coredns instance in the cluster at the same time.

It seems to me that the coredns instance thinks that the upstream DNS servers are unavailable, when in reality they are. I’d appreciate any help or advice 😃 Thank you

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 53 (34 by maintainers)

Most upvoted comments

hello again,

my last followup, as I’m leaving the company tomorrow: since running 1.2.0, the issue never manifested

my colleagues are monitoring the issue and if it resurfaces, they should report here.

we’re also in the process of migrating our upstream dns servers to other software for unrelated reasons, so if our upstream DNS was causing these issues, it should also disappear soon.

thank you everyone for your help.

[ Quoting notifications@github.com in “Re: [coredns/coredns] Forward plugi…” ]

Let me wait until monday, to see if the issue isn’t magically fixed somehow, so we know that it manifests after the upgrade.

Regarding logging which upstream sent the reply - it really doesn’t matter in our case, as they all use the same IP addresses.

See https://github.com/coredns/coredns/pull/1973

So non-conclusive local testing seems to show an improvement.

I wonder how I can properly unit tests this

I assume that [1.2.0] contains #1902 as well?

yes

As said previous a tcpdump would help, you can potentially patch in #1902