kubernetes: DNS intermittent delays of 5s

Is this a BUG REPORT or FEATURE REQUEST?: /kind bug

What happened: DNS lookup is sometimes taking 5 seconds.

What you expected to happen: No delays in DNS.

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster in AWS using kops with cni networking:
kops create cluster     --node-count 3     --zones eu-west-1a,eu-west-1b,eu-west-1c     --master-zones eu-west-1a,eu-west-1b,eu-west-1c     --dns-zone kube.example.com   --node-size t2.medium     --master-size t2.medium  --topology private --networking cni   --cloud-labels "Env=Staging"  ${NAME}
  1. CNI plugin:
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
  1. Run this script in any pod with that has curl:
var=1
while true ; do
  res=$( { curl -o /dev/null -s -w %{time_namelookup}\\n  http://www.google.com; } 2>&1 )
  var=$((var+1))
  if [[ $res =~ ^[1-9] ]]; then
    now=$(date +"%T")
    echo "$var slow: $res $now"
    break
  fi
done

Anything else we need to know?:

  1. I am encountering this issue in both staging and production clusters, but for some reason staging cluster is having a lot more 5s delays.
  2. Delays happen both for external services (google.com) or internal, such as service.namespace.
  3. Happens on both 1.6 and 1.7 version of kubernetes, but did not encounter these issues in 1.5 (though the setup was a bit different - no CNI back then).
  4. Have not tested with 1.7 without CNI yet.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:48:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.10", GitCommit:"bebdeb749f1fa3da9e1312c4b08e439c404b3136", GitTreeState:"clean", BuildDate:"2017-11-03T16:31:49Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
AWS
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Ubuntu 16.04.3 LTS"
  • Kernel (e.g. uname -a):
Linux ingress-nginx-3882489562-438sm 4.4.65-k8s #1 SMP Tue May 2 15:48:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Similar issues

  1. https://github.com/kubernetes/dns/issues/96 - closed but seems to be exactly the same
  2. https://github.com/kubernetes/kubernetes/issues/45976 - has some comments matching this issue, but is taking the direction of fixing kube-dns up/down scaling problem, and is not about the intermittent failures.

/sig network

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 89
  • Comments: 256 (111 by maintainers)

Commits related to this issue

Most upvoted comments

In my tests, using this option on /etc/resolv.conf options single-request-reopen

Fixed the problem. But I don’t find a “clean” way to put it on pods in kubernetes 1.8. What I do:

        lifecycle:
          postStart:
            exec:
              command:
              - /bin/sh
              - -c 
              - "/bin/echo 'options single-request-reopen' >> /etc/resolv.conf"

@mikksoone Could you try if it solve your problem too?

ld-musl-x86_64.so.1-fix-musl-1.1.18-r3-alpine3.7.tar.gz

This bug described a few different types of slow DNS queries, in my use case the slow DNS requests occurred on alpine:3.7/3.8, due to the timed out for additional AAAA query along with A query by default. currently no need to query AAAA records by default, so the solution is removing AAAA resolution from the implements (musl/src/network/lookup_name.c, function ‘name_from_dns’), and the code fix worked for me:

    //musl/src/network/lookup_name.c, function 'name_from_dns'
    static const struct { int af; int rr; } afrr[2] = { 
        { .af = AF_INET6, .rr = RR_A },
        { .af = AF_INET, .rr = RR_AAAA },
    };  

    for (i=0; i<2; i++) {
        if (family != afrr[i].af) {
            qlens[nq] = __res_mkquery(0, name, 1, afrr[i].rr,
                0, 0, 0, qbuf[nq], sizeof *qbuf);
            if (qlens[nq] == -1) 
                return EAI_NONAME;
            nq++;
        }   

        //hack: if set the AF_UNSPEC family, just return ipv4 result
        if (family == AF_UNSPEC) break;
    }

without the fix, it took around 5 secs:

$ time nslookup wx.qlogo.cn nslookup: can’t resolve ‘(null)’: Name does not resolve

Name: wx.qlogo.cn Address 1: 180.97.8.101 Address 2: 101.227.160.54 Address 3: 61.151.186.31 31.186.151.61.dial.xw.sh.dynamic.163data.com.cn Address 4: 180.163.21.155 Address 5: 180.163.26.115 Address 6: 180.97.8.25 Address 7: 180.163.25.31 Address 8: 180.163.21.101 Address 9: 180.163.26.112 Address 10: 180.97.8.36 Address 11: 101.226.90.164 Address 12: 180.163.26.111 Address 13: 101.226.233.167 Address 14: 61.151.168.149 149.168.151.61.dial.xw.sh.dynamic.163data.com.cn real 0m 5.26s user 0m 0.00s sys 0m 0.00s

after replacing the /lib/ld-musl-x86_64.so.1, by the new one with my fix (see attached in this comment) to alpine linux docker, the DNS slow query gone away:

$ time nslookup wx.qlogo.cn nslookup: can’t resolve ‘(null)’: Name does not resolve

Name: wx.qlogo.cn Address 1: 61.151.168.149 149.168.151.61.dial.xw.sh.dynamic.163data.com.cn Address 2: 101.226.233.167 Address 3: 180.163.26.111 Address 4: 101.226.90.164 Address 5: 180.97.8.36 Address 6: 180.163.26.112 Address 7: 180.163.21.101 Address 8: 180.163.25.31 Address 9: 180.97.8.25 Address 10: 180.163.26.115 Address 11: 180.163.21.155 Address 12: 61.151.186.31 31.186.151.61.dial.xw.sh.dynamic.163data.com.cn Address 13: 101.227.160.54 Address 14: 180.97.8.101 real 0m 0.31s user 0m 0.00s sys 0m 0.00s

Also my nodejs app response time had decreased on the patched alpine.

I pushed the docker image with this fix onto public site: https://hub.docker.com/r/geekidea/alpine-a/, and you can use such docker images as below:

docker pull geekidea/alpine-a:3.7 docker pull geekidea/alpine-a:3.8 docker pull geekidea/alpine-a:3.9 docker pull geekidea/alpine-a:3.10


Note

  1. pls be aware that the code fix changed the default DNS query behaviors on alpine linux.
  2. the original nslookup ported from alpine (busybox) also did the PTR requests sequentially, after finishing A/AAAA queries.

Just to update, I’ve submitted two patches to fix the conntrack races in the kernel - http://patchwork.ozlabs.org/patch/937963/ (accepted) and http://patchwork.ozlabs.org/patch/952939/ (waiting for a review).

If both are accepted, then the timeout cases due to the races will be eliminated for those who run only one instance of a DNS server, and for others - the timeout hit rate should decrease.

To completely eliminate when |DNS server| > 1 is a non-trivial task and is still WIP.

Doesn’t solve the issue for me. Even with this option in resolv.conf I get timeouts of 5s, 2.5s and 3.5s - and they happen very often, twice per minute or so.

Note for Go users on Alpine: the Go 1.13 DNS resolver will support use-vc (golang/go#29594) and single-request (golang/go#29661).

@brb Tested with 5.0.0-rc6 the error rate has gone down to zero!

Just in case someone got here because of dns delays, in our case it was arp table overflow on the nodes (arp -n showing more than 1000 entries). Increasing the limits solved the problem.

I just posted a little write-up about our journey troubleshooting the issue, and how we are worked around it in production: https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/.

Alpine 3.18 with included musl 1.2.4 seemed to have finally fixed this issue

https://www.alpinelinux.org/posts/Alpine-3.18.0-released.html

We in Pinterest is using kernel 5.0 and the default iptable set up, but still hitting this issue pretty badly:

here is a pcap trace of packages showed clearly proof of UDP packets not getting forwarded out to dns and client side is having 5s timeout / dns level retries, 10.3.253.87 and 10.3.212.90 are user pods and 10.3.23.54.domain is the dns pod


07:25:14.196136 IP 10.3.253.87.36457 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 61
--
07:25:14.376267 IP 10.3.212.90.56401 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 47
07:25:19.196210 IP 10.3.253.87.36457 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 61
07:25:19.376469 IP 10.3.212.90.56401 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 47
07:25:24.196365 IP 10.3.253.87.36457 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 61
07:25:24.383758 IP 10.3.212.90.45350 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 39
07:25:26.795923 IP node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.37345 > 10.3.23.54.domain: 8166+ [1au] A? kubernetes.default. (47)
07:25:26.797035 IP 10.3.23.54.domain > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.37345: 8166 NXDomain 0/0/1 (47)
07:25:29.203369 IP 10.3.253.87.57701 > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.30001: UDP, length 60
07:25:29.203408 IP node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.57701 > 10.3.23.54.domain: 52793+ [1au] A? mavenrepo-external.pinadmin.com. (60)
07:25:29.204446 IP 10.3.23.54.domain > node-cmp-test-kubecluster-2-0a03f89c.ec2.pin220.com.57701: 52793* 10/0/1 CNAME pinrepo-external.pinadmin.com., CNAME internal-vpc-pinrepo-pinadmin-internal-1188100222.us-east-1.elb.amazonaws.com., A 10.1.228.192, A 10.1.225.57, A 10.1.225.205, A 10.1.229.86, A 10.1.227.245, A 10.1.224.120, A 10.1.228.228, A 10.1.229.49 (998)

I did a conntrack -S and there is no insertion failure, which indicates that race 1 and 2 mentioned in this blog is already fixed, and we are hitting race 3.

# conntrack -S
cpu=0   	found=175 invalid=0 ignore=83983 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=201
cpu=1   	found=168 invalid=0 ignore=79659 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=299
cpu=2   	found=173 invalid=0 ignore=77880 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4184
cpu=3   	found=161 invalid=0 ignore=78778 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=216
cpu=4   	found=157 invalid=0 ignore=80478 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4245
cpu=5   	found=172 invalid=10 ignore=85572 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=346
cpu=6   	found=146 invalid=0 ignore=85334 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4271
cpu=7   	found=162 invalid=0 ignore=84865 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=230
cpu=8   	found=155 invalid=0 ignore=81691 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=259
cpu=9   	found=164 invalid=1 ignore=81550 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=256
cpu=10  	found=180 invalid=0 ignore=92864 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=270
cpu=11  	found=163 invalid=0 ignore=93113 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=238
cpu=12  	found=171 invalid=0 ignore=80868 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=1934
cpu=13  	found=176 invalid=0 ignore=80974 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=2532
cpu=14  	found=174 invalid=0 ignore=91001 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=927
cpu=15  	found=175 invalid=0 ignore=79837 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=585
cpu=16  	found=168 invalid=0 ignore=84899 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=375
cpu=17  	found=172 invalid=0 ignore=84396 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=328
cpu=18  	found=142 invalid=0 ignore=80365 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=1012
cpu=19  	found=163 invalid=0 ignore=80193 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=5308
cpu=20  	found=179 invalid=0 ignore=84980 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=565
cpu=21  	found=200 invalid=0 ignore=80537 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=278
cpu=22  	found=153 invalid=0 ignore=83528 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=430
cpu=23  	found=166 invalid=0 ignore=84160 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=256
cpu=24  	found=189 invalid=0 ignore=81400 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=355
cpu=25  	found=183 invalid=0 ignore=82727 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=352
cpu=26  	found=170 invalid=0 ignore=89293 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=275
cpu=27  	found=183 invalid=1 ignore=82717 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=332
cpu=28  	found=188 invalid=0 ignore=83741 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=242
cpu=29  	found=192 invalid=0 ignore=88601 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=396
cpu=30  	found=166 invalid=0 ignore=84152 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=329
cpu=31  	found=165 invalid=0 ignore=81369 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=229
cpu=32  	found=170 invalid=0 ignore=84275 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4208
cpu=33  	found=160 invalid=0 ignore=86734 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4122
cpu=34  	found=173 invalid=0 ignore=82152 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4231
cpu=35  	found=150 invalid=0 ignore=78019 insert=0 insert_failed=0 drop=0 early_drop=0 error=0 search_restart=4130

There are a few more action items we are trying ATM:

  1. newer version kernel
  2. use 1 dns replica (occupies full node so it has a rather stable run time) so there is 1 rule in iptable
  3. use TCP for DNS.

Will keep you guys updated but if anyone already tried any of the 3 options and has a failure / success story to share I’d be very much appreciated.

/cc @thockin @brb

The second kernel patch to mitigate the problem got accepted (context: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts) and it is out in Linux 5.0-rc6.

Please test it and report whether it has reduced the timeout hit rate. Thanks.

We wrote a blog post describing the technical details of the problem and presenting the kernel fixes: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts.

I’ve been having this issue for some time on kubernetes 1.7 and 1.8. I was dropping dns queries from time to time Yesterday I upgraded my cluster from 1.8.10 to 1.9.6 (kops from 1.8 to 1.9.0-alpha.3) and I started having this same issue ALL THE TIME. The workaround sugested in this issue has no effect and I can’t find any way of stopping it. I’ve made a small workaround by assigning the most requested (and poblematic) DNS to fixed IPs in /etc/hosts. Any idea on where the real problem is? I’ll test with a brand new cluster in the same versions and report back.

Same here, small clusters, no arp nor QPS limits. dnsPolicy: Default works without delays, but this unfortunately can not be used for all deployments.

I think we found another fix for this - to use tcp mode on dns requests in a container by adding the following to the spec:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: app
spec:
  template:
    spec:
      dnsConfig:
        options:
        - name: use-vc  # specifies to local dns resolver to use tcp over udp.  udp is flakey in containers
      containers:
...

@axot @chrisghill we use coredns as daemonset and dnsmasq in front. We also use conntrack bypass and have only the dns infrastructure on the same node. The postmortem of an incident that triggered the change from our dnsmasq + tc-flannel and coredns backend via serviceip can be found at https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md Our setup can be found at https://github.com/zalando-incubator/kubernetes-on-aws/tree/dev/cluster/manifests/coredns-local. We boot the nodes with conntrack bypass https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/node-pools/worker-default/userdata.clc.yaml#L115 to dns ports on the node ip https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/node-pools/worker-default/userdata.clc.yaml#L217 via environment variable set https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/node-pools/worker-default/userdata.clc.yaml#L154 I hope this helps you to build a solid dns infrastructure in kubernetes.

NodeLocalDNS uses TCP connections from the local DNS servers to the cluster DNS service IP, which are more robust than UDP. That is, a single pack drop kills a UDP transaction, TCP recovers comparatively instantly via handshaking and resending.

Anyway, to completely eliminate the DNS timeout issue, use Linux kernel v5.0 (the final release is probably next week) and Cilium for k8s networking. The latter replaces kube-proxy for accessing services with its own LB implementation (BPF based) which selects an endpoint based on a packet hash. So, in the case of two racing DNS requests, the same endpoint is selected, and thus neither packet is dropped => no DNS timeouts.

@thockin @bowei

requesting your feedback therefore tagging you.

can this be of any interest here: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

there are multiple issues reported for this issue in kubernetes project, and will be great to have it resolved for everyone.

We are facing the same issue. Applying the single-request-reopen parameter to our pods’ resolv.conf “fixes” the issue, but there is one other piece of information I’d like to add.

We noticed that if we change the DNS address in one of our pods’ resolv.conf to one of our core-dns pods’ address, everything works fine, no timeouts. But when we go through the default configuration, which is the core-dns service’s address, we get the intermittent 5 seconds delay.

Since the single-request-reopen parameter controls the usage of one socket for more than one DNS request, it might be that k8s’ service implementation somehow is confused by receiving more than one request through the same socket.

@george-angel> […] your dns deployment being no longer highly available

Correct, this was a design decision and discussed extensively as part of the KEP process: see https://github.com/kubernetes/enhancements/pull/1005#pullrequestreview-231381118 and two other PRs linked earlier.

One of the initial ideas which I really liked was to shadow a ClusterIP of kube-dns in-cluster Deployment for node-local-dns (https://github.com/kubernetes/community/pull/2842#discussion-diff-227648753R91) – this was not added to the spec to avoid tying it too much to the current implementation of how Services works, but I’d assume this scenario would still work (with kube-proxy in iptables mode only).

Still, node-local-dns is the recommended solution and is now in beta. If necessary, HA can be added in some way or another.

We (uSwitch) run it in production across multi-AZ clusters. It’s in non-HA setup (we decided to only consider adding HA if/once it starts causing any stability issues). We can’t be happier with it as our DNS long-tail latency improved significantly and DNS became a non-issue.

I can recommend node-problem-detector to monitor node-local-dns. Specifically, we have 3 custom plugins running tests every second (one testing node-local-dns itself; another one testing upstream external DNS and last one testing in-cluster kube-dns). This can be coupled with drano to cordon and drain nodes experiencing issues with node-local-dns. In practice we yet to see node-local-dns breaking in some way; the only cases of node-local-dns flakiness so far coincided with either node loss, or whole kernel (temporary) freeze.

edit:

@george-angel> […] If its down (upgrade […]

Good point. In our clusters we terminate nodes by age (currently max age is set to 1 week) with help of surtr, so our node-local-dns DaemonSet is configured with a updateStrategy set to OnDelete. No more frivolous upgrades with service interruption!

FYI: single-request-reopen didn’t help in my case (https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-359897058)

Kubernetes 1.10, AWS

# cat /etc/resolv.conf
nameserver 100.64.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5
options single-request-reopen
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/100.64.0.10:53] query timed out after 5000 milliseconds (no stack trace available)

OS in container:

PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"

I don’t know if Netty or OpenJDK support this option

I am by the way experiencing the same thing. With Kubernetes 1.10+CoreOS+Weave+CoreDNS/kube-dns, I see constant 5s latency on DNS resolution. tcpdump shows that the first AAAA requests get lost somehow: https://hastebin.com/banulayire.swift. With single-request or single-request-reopen, the issue is gone.

https://github.com/kubernetes/kubernetes/issues/62628

use-vc didn’t work for us (AKS). The queries were consistent, but they all took about 8.5 seconds. However single-request-reopen worked.

Try to add

options use-vc

to your resolv.conf. It will force TCP for DNS lookups and will workaround this issue with ease

At this point, running kube-dns or dnsmasq on every node becomes very attractive.

Various issues discuss this - #45363 for instance.

Each workaround or fix doesn’t work for every use case, so the user, sys admin, and app developer should consider which one is fine to himself, also the costs & risks for patching some tc/iptables scripts, updating new kernel, or patching libc musl.

in my use case it was a nodejs app response timed out for 5+secs when requesting some URI’s (like http://wx.qlogo.cn/**) on alpine linux docker, so I coded the fix (https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-409603030) in musl libc, it’s less costs & risks than patching tc scripts/iptables ruls or upgrading the hosts’ kernel for me. also I don’t think tc can work very well for the nodes hosting 1000k+ live tcp connections (or conntrack items).

We faced the same issue on a small self-managed cluster. The problem solved by scaling down the coreDNS pods to 1 pod.
This is a strange and unexpected solution, but it has solved the problem.

Cluster info:

nodes arch/OS:  amd64/debian
master nodes:   1
worker nodes:   6
deployments:    100
pods:           150

@zhan849 Alternativelly, you could use Cilium’s kube-proxy implementation which does not suffer from the conntrack races (it does not use netfilter/iptables).

https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/

– Quentin Machu

On March 5, 2019 at 01:35:02, harperwang (notifications@github.com) wrote:

@inter169 https://github.com/inter169 Already tested 3.7, 3.8, 3.9 same dns behavior on Kubernetes 13.3, weave 2.5.0 and hyper-v, it means they did not include concurrent A and AAAA queries sent through one socket and everybody is looking to fix whole cluster just for image fix. @jcperezamin https://github.com/jcperezamin my fix (#56903 (comment) https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-409603030 , docker hub: geekidea/alpine) was removing the concurrent AAAA query by default, it maybe break the modern DNS flavor to be adapt to ipv4/v6, and that’s the reason why I didn’t contact to alpine community.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-469607791, or mute the thread https://github.com/notifications/unsubscribe-auth/ABRUQQ1EjaZ6EX689v6WfoQS-moPp0rqks5vTjpGgaJpZM4Q4rPQ .

+1 I would like to understand more as well. It’s really crazy that an issue so fundamental exists for kubernetes.

@szuecs is there any configuration you are aware of that would actually eliminate this issue? Currently, the only solution I know of is to switch to tcp for dns (use-vc) but that’s not supported by all distros. I also assume that using calico or vpc routing on aws would bypass it because all pods have routable IP addresses and thus nat is never needed.

Do you think that running coredns on each node, with host Network =true would work?

See https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/ for a more in-depth description of the issue on Kubernetes and a workaround that doesn’t involve rewriting musl or glibc.

@KIVagant

This is a libc option, this does not generally depend on the application but on the libc used. Except if they have their own implementation of the resolv stack (which I don’t know for OpenJDK/Netty)

Please however confirm that the issue actually comes from the conntrack race condition we are talking about, or something else. To confirm, use watch -n1 conntrack -S and check whether the insert_failed column is increasing as you are getting time-outs.

My tc-based workaround should fix your issue if that’s the case.

Sandor Szücs good question! proxying dns in general is an interesting direction worth poking, however we run kubernetes in highly customized way and with fairly large scale so it’d be hard to plug and play most of open sourced solution. Also as @rata said it’s not solving the root cause. The 3 possible alleviations I posted above fits to our current production setup better 😃

Nodelocal DNSCache uses TCP for all upstream DNS queries(alleviation 3 that was mentioned in your previous comment) , in addition to skipping connection tracking for client pod to nodelocal DNS requests. It can be configured so that client pods continue to use the same DNS Server IP so the only change would be to deploy the daemonset. There are at least a couple of comments in this issue about clusters seeing significant improvement in DNS reliability and performance after deploying nodelocal dnscache. Hope that feedback helps.

@szuecs please note that there is upstream support for that and stable in 1.18 (haven’t tried it myself): https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/

I guess this does not solve the root cause of the ndots problem, although is probably amortized. I guess this will just solve the conntrack races, that is the issue causing the intermittent delays.

@zhan849 why not using a daemonset and bypass conntrack? We use dnsmasq in front of coredns in a daemonset pod and use cloudinit and some systemd units to set resolv.conf values via kubelet to the node local dnsmasq running in hostnetwork. Works great even if we spike dns traffic to oomkill coredns. This happened in a nodejs heavy cluster. https://github.com/zalando-incubator/kubernetes-on-aws/tree/dev/cluster/manifests/coredns-local Everything else is in systemd units I can’t share.

@dbyron0: any chance of making the source for your node-problem-detector plugins available?

Here you go: uswitch/node-problem-detector#4 (specifically https://github.com/uswitch/node-problem-detector/pull/4/commits/e124ea7241a671978551dd13f6b3ffaf87f96a80).

You might want to parameterize & change bits & pieces there, as it is tailored to specific a setup. Log an issue in the repo or hit me on k8s slack if any questions, so that we can keep this comment thread on the topic.

@realdimas any chance of making the source for your node-problem-detector plugins available? They sound super useful. Thanks much for all the info.

Folks, I’d like to reiterate in this comment thread what Tim Hockin, Pavithra Ramesh and others mentioned already: node-local DNS cache daemon is considered to be a solution to multiple causes of DNS long-tail latency/timeout issue.

Please have a look at

The source code of the implementation for node-local DNS cache can be found at https://github.com/kubernetes/dns/tree/898b99f8a72a547329ea7e4b28f63bc79375cac2/cmd/node-cache. Its essentially is a minimalist CoreDNS caching daemon with a static non-routable IP address with an embedded wrapper, which takes care of setting up dummy network interface and exempting its flows from connection tracking: https://github.com/kubernetes/dns/tree/898b99f8a72a547329ea7e4b28f63bc79375cac2/cmd/node-cache

Canonical manifests for installation of the DaemonSet: https://github.com/kubernetes/kubernetes/tree/0216ccf80a604b15bb19752dccd23ac2e62f1e10/cluster/addons/dns/nodelocaldns

Once installed, configured and verified – all you have to do is to re-point kubelet’s ClusterDNS to it.

While this feature officially went into alpha only in v1.13 (and graduated to beta in Kubernetes v1.15), it does not require you to run a recent version. It actually works well even with Kubernetes v1.12!

Not sure how it’s still a question of what’s technically happening: https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/ Several fixes have been committed into the kernel to fix the issue. Kernel 5.1 has the final one. – Quentin Machu On May 10, 2019 at 09:43:04, prameshj @.***) wrote: I see that glibc 2.16 introduced parallel v4(A) and v6(AAAA) queries. https://abi-laboratory.pro/?view=changelog&l=glibc&v=2.16.0 As a result of this, if AAAA gets a response first and as Chris O’Haver https://github.com/chrisohaver explained, AAAA will say NOERROR, but have no answer, which might make the client error out with “no such hostname”? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#56903 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKFIQPKXDL6SNAERRQMELLPUWQZPANCNFSM4EHCWPIA .

Thanks @Quentin-M based on the description given by @xeor , it looks like a different issue than a kernel race and requests being dropped. That’s why i was suggesting to open a new issue and discuss there.

@xeor what you are describing doesn’t seem at all related to this issue.

the DNS query from networking perspective is the same whether you’ve initiated it from dig or curl command.

if dig always works and curl works only every 5 seconds (?) this sounds like something else all-together, but maybe I misunderstand your test case.

@xeor @prameshj we are running K8s 1.13.2 with nodelocaldns on ~10 different clusters (CoreOS 1967.6.0, kernel 4.14.96) and have not had any issues with dns timeouts since moving to this setup across the while fleet.

The problem is that we would increase the number of dns requests to kube-dns and number of conntrack entries/DNATs which might be causing the latency problem to begin with. For a cache miss, nodelocal is going to query kube-dns as well. You are right, the nodelocal connections to kube-dns will be TCP, so those would atleast get RST instead of blackholed.

Just to be clear though, using options single-request doesn’t resolve the issue, it just makes it 99% unlikely to happen for most people. The problem of two packets coming in right back to back can still happen, but without it sending them back to back, it’s really unlikely to happen.

And as he also stated above, decresing the timeout so that it retries more often also helps give it a working chance it won’t hit back to back with another packet and help that one of them will make it through, though you will have a delay.

@ahmadalli for me I was able to get Alpine to respect the timeout in /etc/resolv.conf, setting it to 1 decreased the impact of this. It still happened just as frequently, but took my 5/10/15 second timeouts to 1/2/3 seconds. Eventually I moved to Debian though… This is a very frustrating problem.

I don’t know too much about it, and I ran into all this after I made the Debian switch, but there may be a CNI-layer solution somewhere.

https://github.com/projectcalico/calico/issues/2073 https://github.com/weaveworks/weave/issues/3287 https://github.com/coreos/flannel/issues/1004

Just want to put down a record here. By adding options single-request to resolv.conf resolved the problem for us in K8S production.

I would think the option "single-request’ better fits the problem scenario than “single-request-reopen”, but anyway you can try either one see whether it works for you.

@jcperezamin for the awareness https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md and we do now daemonset to run a 2 container pod with dnsmasq+coredns and add a separate FW rule to bypass conntrack. This works without building a new base image and rebuild all your containers.

I think https://github.com/kubernetes/kubernetes/issues/70707 and the related issues are a better subscribe target. The feature is also already available as an alpha in 1.13+

This issue is not a Kubernetes issue, per se. We’re working on a per-node DNS cache model which will hopefully alleviate this quite a lot. I’ll close this issue, though the conversation can continue. It’s just not actionable (or rather the cache is the action)

@thockin, where can we track the per-node cache design and work?

@nitin302 as mentioned by @mikksoone, setting dnsPolicy: Default in your pod fixes the issue (for us, anyway). However, it means you won’t be able to resolve internal services by name. You’ll have to expose them in order to access the services.

@marshallford Hey there. The most common issue is that I built the image with the assumption the community was running their DNS server on port 5353 (unprivileged), so the rules apply to port 5353 rather than 53. A contributor and I changed the script / Dockerfile yesterday to default to 53 instead. Tell me if that helps! You do not need to blow any nodes, or do anything besides running the container on every nodes. You should be able to see the tc rules, and conntrack’s insert_failed being less insane.

https://github.com/Quentin-M/weave-tc

For what its worth for future readers, we have hammered out a solution that seems to work really well.

Solution Overview Runs dnsmasq on each node. expose to the pods via a static ip on a new local adapter on an ip that is the same for all nodes. kubelets --cluster-dns points to this ip address

How the solution performs Test Cluster details:

  • kubernetes 1.10.3
  • built by kops 1.10.beta-2 on aws
  • nodes are centos
  • 3-az cluster, 3 masters 6 workers
  • coredns, 4 pods ( up from the standard 2: more makes the problem worse)

We used this dns-tester to run some tests.

Using the stock arrangement, we see 1 lookup failure in every 4000 requests. This sounds like a small number, but in our testing, the probability is higher with in initial requests. For example, though the tester sees failures 1/4000 requests, manually running curl will fail with with a dns lookup more like one in 50 times.

Using the solution described here, we see one lookup failure in approximately 1M requests. Further, lookups with curl seem to never fail. It seems to us that this arrangement actually bypasses, vs simply masking, the UDP packet loss. We have not confirmed this with conntrack yet.

More about the solution We run dnsmasq on the nodes ( see above for examples-- we actually installed in on our base image for our nodes, but a daemonset as suggested by @szuecs would work well if you prefer a DaemonSet).

Then, we created a link-local adapter on the nodes on a static ip ( we used 192.18.0.1 ):

cat <<EOF > /etc/sysconfig/network-scripts/ifcfg-lo:0
DEVICE=lo:0
BOOTPROTO=static
IPADDR=198.18.0.1
NETMASK=255.255.255.255
ONBOOT=yes
EOF

ifup lo:0

Dnsmasq listens on this ip:

cat <<EOF > /etc/dnsmasq.d/colinx
cache-size=1000
log-queries
dns-forward-max=1500
all-servers
neg-ttl=30
interface=lo
listen-address=198.18.0.1
bind-interfaces
server=/cluster.local/100.64.0.10#53
server=/in-addr.arpa/100.64.0.10#53
server=/ip6.arpa/100.64.0.10#53
EOF

Since this ip matches across all nodes, we can simply do --cluster-dns 192.18.0.1,<kubedns>. This addresses difficulty in setting it in kops, where we have limited control of the manifests and nodeup cycle.

This solution meets all of our requirements:

  • performs well
  • doesn’t require doing anything to all pods
  • doesn’t require manifest changes that are hard to make when using kops
  • doesn’t expose a dns server on public ports
  • easy to back out if it doesnt work

To set the DNS resolver to the right EC2 node we do this:

Like this every POD has a /etc/resolv.conf different from every node and the first nameserver is the local one. The nameserver target is this dnsmasq daemonset https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/manifests/kube-dns/node-local-daemonset.yaml

@maxlaverse Thank you for your note about the typo, for the word usage of “fix” that I replaced with “mitigate”, and for your great initial troubleshooting work!

You’ve tested the ipvs back-end of kube-proxy and it didn’t solve the issue.

This is what we use daily, yes.

The lvs wiki states that the ipvs module uses its own connection tracking system. Since there is only network adress translation to be done, and no filtering or other kind of mangling involved, I had good hope it would work better. Do you know why ?

This is is great find, I was not aware at all. Given the wording of the documentation you provided, it would sound like one needs to unload the netfilter conntrack module explicitly to avoid double-tracking, which we haven’t done. The information provided in the kubernetes documentation regarding ipvs is actually confusing, as it states the nf_conntrack_ipv4 must be loaded.

Thanks for sharing this @Quentin-M.

This means that adding --random-fully does fix the packet loss, as the flag only acts on the SNAT race!

I believe there is a typo here and you meant does not fix the packet loss.

I’ll use this opportunity to clarify one point 😉

I hope the article about connection timeouts on Kubernetes was not misleading but this flag NF_NAT_RANGE_PROTO_RANDOM_FULLY (or its user-space switch equivalent --random-fully) is not supposed to fix anything at all. It allows starting from a random number when incrementally looking for a free port available for a network translation. It mitigates the issue but doesn’t solve it. The conntrack record is still created as one of the first hook of the POSTROUTING chain and inserted in the conntrack table at one of the last, leading to a race condition.

I have one question that I think was asked a few times over all the related GitHub issues on that topic, and I believe the answer could interest people here. You’ve tested the ipvs back-end of kube-proxy and it didn’t solve the issue. Do you know why ? The lvs wiki states that the ipvs module uses its own connection tracking system. Since there is only network adress translation to be done, and no filtering or other kind of mangling involved, I had good hope it would work better.

Just joining some dots: https://github.com/weaveworks/weave/issues/3287 https://github.com/kubernetes/kubernetes/issues/45976

and while I’m talking about dots let me recommend using fully-qualified names where possible, e.g. google.com. - the dot at the end stops the resolver from following the search path, so you don’t get lookups for google.com.svc.cluster.local. and so on.

Can you pls share set of scrupta/commands you used to generate above stats

Also does someone has grafana dashboard template for kube-dns?

@bowei sadly this happens in very small clusters as well for us, ones that have so few containers that there is no feasible way we’d be hitting the QPS limit from AWS

We have the same issue within all of our kops deployed aws clusters (5). We tried moving from weave to flannel to rule out the CNI but the issue is the same. Our kube-dns pods are healthy, one on every host and they have not crashed recently.

Our arp tables are no where near full (less than 100 entries usually)