coredns: plugin/kubernetes: Intermittent NXDOMAIN for headless service
After trying to track down a random NXDOMAIN response I came across in my cluster, I’ve now noticed a pattern.
I’m seeing repeated dns lookups for a headless service succeed with NOERROR for a duration, and then start failing with NXDOMAIN for a period of time that coincides with the resyncperiod
config setting, and then go back to succeeding with NOERROR.
I can’t figure out what causes the initial NXDOMAIN response, but it always lasts for the resyncperiod
duration exactly (changing this config value changes the error duration), and it only happens with the headless services on my cluster.
Any thoughts on what could explain this behavior?
More info:
Corefile:
.:53 {
errors
log stdout
health
kubernetes cluster.local 10.3.0.0/24 {
resyncperiod 3m
}
proxy . /etc/resolv.conf
cache 30
}
I’ve got this service definition:
apiVersion: v1
kind: Service
metadata:
labels:
app: rabbitmq-non-ha-insipid-grizzly
name: rabbitmq
namespace: default
spec:
clusterIP: None
ports:
- name: ampq
port: 5672
protocol: TCP
targetPort: ampq
- name: rabbitmq-dist
port: 25672
protocol: TCP
targetPort: rabbitmq-dist
- name: epmd
port: 4369
protocol: TCP
targetPort: epmd
- name: metrics
port: 9090
protocol: TCP
targetPort: 9090
selector:
app: rabbitmq-non-ha-insipid-grizzly
type: ClusterIP
Describe shows:
$ kubectl describe svc rabbitmq
Name: rabbitmq
Namespace: default
Selector: app=rabbitmq-non-ha-insipid-grizzly
Type: ClusterIP
IP: None
Port: ampq 5672/TCP
Endpoints:
Port: rabbitmq-dist 25672/TCP
Endpoints:
Port: epmd 4369/TCP
Endpoints: 10.2.78.4:4369
Port: metrics 9090/TCP
Endpoints: 10.2.78.4:9090
Session Affinity: None
No events.
Dig command will succeed for a while like:
$ kubectl --namespace kube-system exec -it $COREDNS_POD_NAME dig @localhost AAAA rabbitmq.default.svc.cluster.local
; <<>> DiG 9.11.1-P1 <<>> @localhost AAAA rabbitmq.default.svc.cluster.local
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14909
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 811b59a8ff3fdf27 (echoed)
;; QUESTION SECTION:
;rabbitmq.default.svc.cluster.local. IN AAAA
;; AUTHORITY SECTION:
cluster.local. 30 IN SOA ns.dns.cluster.local. hostmaster.cluster.local. 1515440427 7200 1800 86400 60
;; Query time: 0 msec
;; SERVER: ::1#53(::1)
;; WHEN: Mon Jan 08 19:40:27 UTC 2018
;; MSG SIZE rcvd: 129
with coredns logs showing:
10.2.1.7 - [08/Jan/2018:19:40:27 +0000] "AAAA IN rabbitmq.default.svc.cluster.local. udp 52 false 512" NOERROR qr,aa,rd,ra 105 283.465µs
And then fail like this for a duration of resyncperiod
:
$ kubectl --namespace kube-system exec -it $COREDNS_POD_NAME dig @localhost AAAA rabbitmq.default.svc.cluster.local
; <<>> DiG 9.11.1-P1 <<>> @localhost AAAA rabbitmq.default.svc.cluster.local
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 45919
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: b2883d6384c54c62 (echoed)
;; QUESTION SECTION:
;rabbitmq.default.svc.cluster.local. IN AAAA
;; AUTHORITY SECTION:
cluster.local. 30 IN SOA ns.dns.cluster.local. hostmaster.cluster.local. 1515440607 7200 1800 86400 60
;; Query time: 0 msec
;; SERVER: ::1#53(::1)
;; WHEN: Mon Jan 08 19:43:27 UTC 2018
;; MSG SIZE rcvd: 129
with coredns logs showing:
10.2.1.7 - [08/Jan/2018:19:43:27 +0000] "AAAA IN rabbitmq.default.svc.cluster.local. udp 52 false 512" NXDOMAIN qr,aa,rd,ra 105 346.36µs
I also see the same thing for type A
lookups.
This is on image: coredns/coredns:0.9.9
and k8s server version v1.8.1+coreos.0
.
edit: accidentally had the 2 coredns log snippets swapped
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 38 (29 by maintainers)
We could add a long run integration test to coredns/ci, and add a separate command for is such as “/long-run-test”.
By “cache” in this context - i mean the k8s api cache. I don’t think the dns cache has any involvement here.