coredns: plugin/kubernetes: Intermittent NXDOMAIN for headless service

After trying to track down a random NXDOMAIN response I came across in my cluster, I’ve now noticed a pattern.

I’m seeing repeated dns lookups for a headless service succeed with NOERROR for a duration, and then start failing with NXDOMAIN for a period of time that coincides with the resyncperiod config setting, and then go back to succeeding with NOERROR.

I can’t figure out what causes the initial NXDOMAIN response, but it always lasts for the resyncperiod duration exactly (changing this config value changes the error duration), and it only happens with the headless services on my cluster.

Any thoughts on what could explain this behavior?


More info:

Corefile:

.:53 {
    errors
    log stdout
    health
    kubernetes cluster.local 10.3.0.0/24 {
        resyncperiod 3m
    }
    proxy . /etc/resolv.conf
    cache 30
}

I’ve got this service definition:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: rabbitmq-non-ha-insipid-grizzly
  name: rabbitmq
  namespace: default
spec:
  clusterIP: None
  ports:
  - name: ampq
    port: 5672
    protocol: TCP
    targetPort: ampq
  - name: rabbitmq-dist
    port: 25672
    protocol: TCP
    targetPort: rabbitmq-dist
  - name: epmd
    port: 4369
    protocol: TCP
    targetPort: epmd
  - name: metrics
    port: 9090
    protocol: TCP
    targetPort: 9090
  selector:
    app: rabbitmq-non-ha-insipid-grizzly
  type: ClusterIP

Describe shows:

$ kubectl describe svc rabbitmq
Name:			rabbitmq
Namespace:		default
Selector:		app=rabbitmq-non-ha-insipid-grizzly
Type:			ClusterIP
IP:			None
Port:			ampq	5672/TCP
Endpoints:
Port:			rabbitmq-dist	25672/TCP
Endpoints:
Port:			epmd	4369/TCP
Endpoints:		10.2.78.4:4369
Port:			metrics	9090/TCP
Endpoints:		10.2.78.4:9090
Session Affinity:	None
No events.

Dig command will succeed for a while like:

$ kubectl --namespace kube-system exec -it $COREDNS_POD_NAME dig @localhost AAAA rabbitmq.default.svc.cluster.local
; <<>> DiG 9.11.1-P1 <<>> @localhost AAAA rabbitmq.default.svc.cluster.local
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14909
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 811b59a8ff3fdf27 (echoed)
;; QUESTION SECTION:
;rabbitmq.default.svc.cluster.local. IN	AAAA

;; AUTHORITY SECTION:
cluster.local.		30	IN	SOA	ns.dns.cluster.local. hostmaster.cluster.local. 1515440427 7200 1800 86400 60

;; Query time: 0 msec
;; SERVER: ::1#53(::1)
;; WHEN: Mon Jan 08 19:40:27 UTC 2018
;; MSG SIZE  rcvd: 129

with coredns logs showing:

10.2.1.7 - [08/Jan/2018:19:40:27 +0000] "AAAA IN rabbitmq.default.svc.cluster.local. udp 52 false 512" NOERROR qr,aa,rd,ra 105 283.465µs

And then fail like this for a duration of resyncperiod:

$ kubectl --namespace kube-system exec -it $COREDNS_POD_NAME dig @localhost AAAA rabbitmq.default.svc.cluster.local
; <<>> DiG 9.11.1-P1 <<>> @localhost AAAA rabbitmq.default.svc.cluster.local
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 45919
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: b2883d6384c54c62 (echoed)
;; QUESTION SECTION:
;rabbitmq.default.svc.cluster.local. IN	AAAA

;; AUTHORITY SECTION:
cluster.local.		30	IN	SOA	ns.dns.cluster.local. hostmaster.cluster.local. 1515440607 7200 1800 86400 60

;; Query time: 0 msec
;; SERVER: ::1#53(::1)
;; WHEN: Mon Jan 08 19:43:27 UTC 2018
;; MSG SIZE  rcvd: 129

with coredns logs showing:

10.2.1.7 - [08/Jan/2018:19:43:27 +0000] "AAAA IN rabbitmq.default.svc.cluster.local. udp 52 false 512" NXDOMAIN qr,aa,rd,ra 105 346.36µs

I also see the same thing for type A lookups.

This is on image: coredns/coredns:0.9.9 and k8s server version v1.8.1+coreos.0.


edit: accidentally had the 2 coredns log snippets swapped

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 38 (29 by maintainers)

Most upvoted comments

We could add a long run integration test to coredns/ci, and add a separate command for is such as “/long-run-test”.

By “cache” in this context - i mean the k8s api cache. I don’t think the dns cache has any involvement here.