coredns: TC Flag not set for Externalname kind service when the payload is large

As explained in this article https://easoncao.com/coredns-resolution-truncation-issue-on-kubernetes-kube-dns/

Ideal behavior - the response should contain not just cname but also all ip’s that belongs to the cname. If the server(coredns) is not able to sandwich all the ip’s due to payload limit of 512, it should send partial payload and with Truncate flag set, once this partial truncate response was received by the client, the client will use tcp as a mode to send same dns query again,

Since coredns failed to set the truncate flag and since the “A” authoritative flag was set, clients neither upgrading the request to tcp nor utilizing recursive strategy to send follow up “A” query to cname. The reason behind setting “A” could be acceptable due to the fact that coredns has dual responsibility (as a Authoritative domain for local kubernetes related services, as a recursive resolver for external internet domains)

The only concern here is a coredns just responding with cname without even setting truncate flag.

Tested Potential workarounds:

  1. Forcefully make the clients(pods) to use just tcp mode when interacting with coredns pods. This could be simply configured with the help of use-vc flag like below
  template:
    metadata:
      labels:
        app: customer-dns
    spec:
      dnsConfig:
        options:
          - name: use-vc
      containers:
        - name: customer-dns
          image: nginx:latest
          ports:
            - containerPort: 8080
          resources:
            limits:
              cpu: 300m
              memory: 500Mi
            requests:
              cpu: 200m
              memory: 500Mi
  1. As a win win situation, the customer could use nodelocaldns cache, as the communication from nodelocaldns cache to the coredns pod will be by default tcp, and the communication between pod to the nodelocaldns pod can either be tcp or udp.

As per my testing, implementing nodelocaldns cache solved the issue as now the issue is being masked since nodelocalcache dns implementation didnt act in that unexpected way.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 28 (21 by maintainers)

Most upvoted comments

@TheRealGoku, another simpler workaround would be to add force_tcp to the forward plugin (should work for the same reason node-local-dns works, the backend resolutions is done over tcp).

  forward . /etc/resolv.conf {
    force_tcp
  }

FYI, I believe I’ve discovered the root cause of the inconsistent TC bit I mentioned above. In a nutshell, when we do an upstream lookup if that result was truncated, the answer to the client needs to have the tc bit set. It’s an easy fix for template, less easy for backend users (kubernetes and etcd), and kind of a horrible nightmare for file and friends. The inconsistency stems from the whether to not the total response of CNAME + truncated A records exceeds the max length. In some cases it will, in which case the response gets truncated, and marked truncated. In other cases it will not, in which case the response is not truncated, and not marked truncated (however the set of A records is in fact truncated).

I’m working on a fix and should open a PR early next week.

I could confirm that its the same behavior for ExternalName in CoreDNS-1.8.3 on EKS 1.20, The issue still persist. I could also conclude that this issue is not happening for direct queries to a domain which has larger reply payload but only for ExternalName type service

Environment Kubernetes version - EKS 1.20 CoreDns version - CoreDNS-1.8.3

Steps:

  1. To make sure we have control of less and more dns payload , I have created a private hosted zone with entries like below and attached it to the same VPC of EKS cluster image image

  2. Now create a ExternalName type services referring above names, in real life scenario this could be elastically scaled redis cluster domains. kubectl apply -f lesspayload.yaml kubectl apply -f morepayload.yaml kubectl apply -f nginx-deployment.yaml

File - lesspayload.yaml

apiVersion: v1
kind: Service
metadata:
  name: lesspayload
  namespace: default
spec:
  externalName: lesspayload.playwithtc.com
  type: ExternalName

File - morepayload.yaml

apiVersion: v1
kind: Service
metadata:
  name: morepayload
  namespace: default
spec:
  externalName: morepayload.playwithtc.com
  type: ExternalName

File - nginx-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 1
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:latest
          ports:
            - containerPort: 80
  1. Verify that ExternalName service got created
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.100.0.1 <none> 443/TCP 36m
lesspayload ExternalName <none> lesspayload.playwithtc.com <none> 17m
morepayload ExternalName <none> morepayload.playwithtc.com <none> 17m
$ kubectl logs deployment/coredns -n kube-system
Found 2 pods, using pod/coredns-85cc4f6d5-zs64x
.:53
[INFO] plugin/reload: Running configuration MD5 = 47d57903c0f0ba4ee0626a17181e5d94
CoreDNS-1.8.3
linux/amd64, go1.13.15, 4293992b
  1. Exec into the nginx pod and try curl with verbose

For - lesspayload.default.svc.cluster.local.

root@nginx-deployment-66b6959498-rxh4z:/#
root@nginx-deployment-66b6959498-rxh4z:/# curl -v lesspayload.default.svc.cluster.local.

//truncated
- Expire in 1 ms for 1 (transfer 0x5587a2e58fb0)
- Expire in 1 ms for 1 (transfer 0x5587a2e58fb0)
- Trying 10.0.5.24...
- TCP_NODELAY set
- Expire in 149998 ms for 3 (transfer 0x5587a2e58fb0)
- E

^C

For - curl -v morepayload.default.svc.cluster.local.

root@nginx-deployment-66b6959498-rxh4z:/# curl -v morepayload.default.svc.cluster.local.

//truncated
- Expire in 2 ms for 1 (transfer 0x557b79cdafb0)
- Expire in 1 ms for 1 (transfer 0x557b79cdafb0)
- Expire in 1 ms for 1 (transfer 0x557b79cdafb0)
- Could not resolve host: morepayload.default.svc.cluster.local.
- Expire in 2 ms for 1 (transfer 0x557b79cdafb0)
- Closing connection 0
  curl: (6) Could not resolve host: morepayload.default.svc.cluster.local

Attachment: https://github.com/TheRealGoku/coredns-tc-behavior

  1. coredns-behavior-non_working.pcap - This capture is taken in parallel on client pod side when the curl request “lesspayload.default.svc.cluster.local.” and “morepayload.default.svc.cluster.local.” are made

  2. coredns-behavior-working.pcap - This capture is taken in parallel on client pod side when the curl request “lesspayload.playwithtc.com” and “morepayload.playwithtc.com” are made

As per the above two pcaps, its evident that only in ExternalName scenario with high payload the CoreDNS is not setting TC flag and sending just CNAME in response without any IP response

In other scenario(coredns-behavior-working.pcap) when querying the direct domain which has heavy payload, its setting the TC flag as expected thereby client is making connection over TCP.