coredns: Forward plugin sometimes returns FORMERR
Hi there,
we’re using coredns in our kubernetes clusters as the resolver. We have a rather basic setup:
. {
errors
health
prometheus 0.0.0.0:9153
log . "{remote}:{port} - [{when}] {>id} {type} {class} {name} {proto} {size} {>do} {>bufsize} {rcode} {>rflags} {rsize} {duration}" {
class error
}
cache 30
kubernetes tt-k8s1.ko.seznam.cz 172.16.0.0/12 {
tls /coredns/tls/coredns.pem /coredns/tls/coredns-key.pem /coredns/tls/ca.pem
upstream 10.250.20.222 10.250.50.222
}
forward . 10.250.20.222 10.250.50.222
}
with 10.250.20.222 and 10.250.50.222 being our internal DNS servers handling recursive DNS queries for the rest of the domains, including internal domains as well as all outgoing queries.
We’re using coredns 1.1.2 at the moment.
Sometimes FORMERR errors keep showing up for a few minutes, and they disappear after less than 10 minutes. They show up like this in the log:
10.64.38.104:59837 - [27/Jun/2018:12:13:14 +0200] 56625 A IN www.mapy.cz. udp 29 false 512 FORMERR qr,rd 29 205.372µs
10.66.206.49:37658 - [27/Jun/2018:12:12:51 +0200] 31387 AAAA IN grafana.com. udp 29 false 512 FORMERR qr,rd 29 219.374µs
It seems that the requests are served from cache because of the µs latencies - if they would try to connect to the upstream DNS servers, it would take at least a millisecond.
I managed to do a raw dig for a failing domain when the issue was happening:
$ dig puzzle.ng.seznam.cz @10.65.142.169
; <<>> DiG 9.10.3-P4-Ubuntu <<>> puzzle.ng.seznam.cz @10.65.142.169
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 46786
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; WARNING: EDNS query returned status FORMERR - retry with '+noedns'
;; QUESTION SECTION:
;puzzle.ng.seznam.cz. IN A
;; Query time: 1 msec
;; SERVER: 10.65.142.169#53(10.65.142.169)
;; WHEN: Wed Jun 27 12:12:56 CEST 2018
;; MSG SIZE rcvd: 37
and I tried a subsequent query with +noedns:
$ dig +noedns puzzle.ng.seznam.cz @10.65.142.169
; <<>> DiG 9.10.3-P4-Ubuntu <<>> +noedns puzzle.ng.seznam.cz @10.65.142.169
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 47898
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;puzzle.ng.seznam.cz. IN A
;; Query time: 1 msec
;; SERVER: 10.65.142.169#53(10.65.142.169)
;; WHEN: Wed Jun 27 12:13:05 CEST 2018
;; MSG SIZE rcvd: 37
I’m 100% certain the upstream DNS resolvers were working correctly, because I tried issuing the same query and they returned the answer correctly.
I also don’t see anything else in the log, just the FORMERR errors. I would expect to see some upstream connection errors if coredns would expect the upstream servers to be unavailable.
I’ve only seen this happen to a single coredns instance in the cluster at the same time.
It seems to me that the coredns instance thinks that the upstream DNS servers are unavailable, when in reality they are. I’d appreciate any help or advice 😃 Thank you
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 53 (34 by maintainers)
hello again,
my last followup, as I’m leaving the company tomorrow: since running 1.2.0, the issue never manifested
my colleagues are monitoring the issue and if it resurfaces, they should report here.
we’re also in the process of migrating our upstream dns servers to other software for unrelated reasons, so if our upstream DNS was causing these issues, it should also disappear soon.
thank you everyone for your help.
[ Quoting notifications@github.com in “Re: [coredns/coredns] Forward plugi…” ]
See https://github.com/coredns/coredns/pull/1973
So non-conclusive local testing seems to show an improvement.
I wonder how I can properly unit tests this
yes
As said previous a tcpdump would help, you can potentially patch in #1902