kubernetes: kube-dns: dnsmasq intermittent connection refused

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):


Is this a BUG REPORT or FEATURE REQUEST? (choose one):

Kubernetes version (use kubectl version):

kubectl version
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.7", GitCommit:"8eb75a5810cba92ccad845ca360cf924f2385881", GitTreeState:"clean", BuildDate:"2017-04-27T10:00:30Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.7", GitCommit:"8eb75a5810cba92ccad845ca360cf924f2385881", GitTreeState:"clean", BuildDate:"2017-04-27T09:42:05Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): PRETTY_NAME=“Container Linux by CoreOS 1339.0.0 (Ladybug)”
  • Kernel (e.g. uname -a): 4.10.1-coreos
  • Install tools: custom ansible
  • Others: kube dns related images. gcr.io/google_containers/kubedns-amd64:1.9 and gcr.io/google_containers/kube-dnsmasq-amd64:1.4.1

What happened: java.net.UnknownHostException: dynamodb.us-east-1.amazonaws.com

What you expected to happen: Receive a response to the name lookup request.

How to reproduce it (as minimally and precisely as possible): This is the kicker. We are not able to reproduce this issue on purpose. However we experience this in our production cluster 1 - 500 times a week.

Anything else we need to know: In the past 2 months or so we had experienced a handful of events where DNS was failing for most/all of our production pods and the event would last for 5 - 10 minutes. During this time the kube-dns service was healthy with 3 - 6 available endpoints at all times. We increased our kube-dns pod count to 20 in 20 node production clusters. This level of provisioning alleviated the DNS issues that were taking down our production services. However we still experience at least weekly smaller events ranging from 1 second to 30 seconds which affect a small subset of pods. During these events 1 - 5 pods on different nodes across the cluster experience a burst of DNS failures which have a much smaller end user impact. We enabled query logging in dnsmasq as we were not sure whether the queries made it from the client pod to one of the kube-dns pods or not. What was interesting is that during the DNS events where query logging was enabled none of the name lookup requests that resulted in an exception were received by dnsmasq. At this point my colleague noticed these errors coming from dnsmasq-metrics

ERROR: logging before flag.Parse: W0517 03:19:50.139060 1 server.go:53] Error getting metrics from dnsmasq: read udp 127.0.0.1:36181->127.0.0.1:53: i/o timeout

That error as near as I can tell is basically a name resolution error from dnsmasq-metrics as it’s trying to query the dnsmasq container in the same pod to get dnsmasq’s internal metrics similar to running dig +short chaos txt cachesize.bind.

All of our DNS events are happening at the exact same time that 1 or more dnsmasq-metrics container is throwing those errors. We thought we might be possibly exceeding the default 150 connection limit that dnsmasq has but we do not see any logs indicating that. IF we did we would expect to see these log messages

dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)

Based off of conversations with other cluster operators and users in slack I know that other users are experiencing these same problems. I’m hoping that this issue can be used to centralize our efforts and determine if dnsmasq refusing connections is the problem or a symptom of something else.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 21
  • Comments: 103 (50 by maintainers)

Commits related to this issue

Most upvoted comments

This comment explains the root cause pretty well: https://github.com/weaveworks/weave/issues/3287#issuecomment-387178077

We have switched our resolvers to TCP and since not seen these issues anymore. This is probably better than the 4ms artificial delay to avoid the race which was suggested in the weave issue and is much easier to implement.

The title of this issue should be updated, it doesn’t only affect kube-dns.

I’ve been debugging intermittent DNS errors in a 1.7.2 cluster with Ubuntu 16 nodes on AWS deployed by kops 1.7.x. I manually cut down kube-dns (1.14.5) to just a single running replica so I could watch that EC2 node and capture DNS traffic for analysis. Notes:

  • I have not been able to identify any conntrack issues on the node hosting kube-dns. The DNS stream @ cbr0 interface was about 130 pps. This is way below AWS’s DNS quota of 1024 pps.
  • I’ve correlated DNS lookup failure logs from our apps with missing kube-dns replies. E.g. pod makes a request to kube-dns, kube-dns turns around and successfully queries AWS DNS servers, but then fails to send a reply to the pod (that or responds to only A or AAAA request from the pod but not both). Note that this pattern of DNS requests without matching responses shows up often in traffic captures — a lot more often than the DNS lookup failures our apps report.
  • I have not been able to affect the error rate — CPU load, memory pressure, etc.
  • Using iperf2 to pump traffic between pods in the cluster showed clean TCP flows (logical) but showed UDP packet loss in 200-300pps streams (1 out of a few tens of thousands packets).
  • I could not reproduce intracluster connection issues (e.g. a handful of curl client pods continuously sending HTTP GETs to NGINX server pods behind a service). Hundreds of thousands of requests completed without error. Again, this makes sense as TCP would mask packet loss.

Regarding ndots: You can add a dot to the end of your domain name, this way it will be treated as FQDN and local search will never be attempted arale-ng.cyw3ljy98zq7.eu-west-1.rds.amazonaws.com..

We see the same intermittent DNS resolution issues in all clusters. In our case it’s a Python application and we are failing to resolve external domains. It isn’t related to kube-dns autoscaling events because we are running a ridiculously high but fixed number of kube-dns pods. We are also not hitting conntrack limits.

Kubernetes 1.9 on AWS, Networking is kubenet, same results with kube-dns:1.14.9 and kube-dns:1.14.5.

I observed similar symptoms today.

  1. Application level DNS errors (Java again, so UnknownHostException).
  2. Errors started at the exact same time as one of the kube-dns pods restarted for unknown reasons (exit code 1, no hints in the logs).
  3. Errors then continued for days - i.e. this was not just a transient error
  4. I took a look around and the iptables rules seemed correct
  5. I terminated the kube-dns pod which had a restart count != 0 (the one which had restarted a few days previously). Errors stopped immediately. kube-proxies each logged “deleting connection tracking state for service IP 100.64.0.10, endpoint IP 100.96.3.84” (as expected)
  6. I terminated another kube-dns pod, to force another pod restart, there were still no additional DNS errors.

This is running with flannel, k8s 1.7.6

I think what’s odd is that this continued for days. I have three working theories:

  • Bad node (but it seems OK otherwise)
  • Pod itself keeps state across relaunches (seems unlikely)
  • Nodes / iptables do not recover from the connection table flush, so subsequent connections to the same IP fail

One more data point - I was able to simulate an exit by kubectl exec-ing into the dnsmasq pod and doing a kill 1. This did cause the pod to restart, kube-proxy logged deleting connection tracking state..., but no application level errors were visible. To me that points away from the connection flush towards a bad node or pod, but I’m just guessing at this point.

Next time this happens my plan is:

  • capture DNS traffic on the node with the bad kube-dns pod.
  • maybe try setting kube-dns to restart=Never. I think this will then cause it to schedule on a different node after a restart (?)

@YoniTapingo I run this little script from my container entrypoint:

#!/usr/bin/env sh

echo >> /etc/resolv.conf
echo "options use-vc" >> /etc/resolv.conf

You could also do it in a preStart lifecycle hook if you have root or sudo.

@joekohlsdorf that’s a quick win!! thank you for the tip.

Sorry for the noise, I sent the comment a few times by accident while editting it.

Hello @ApsOps I’m a coworker of @joanfont. What you mention is true but it should not be neccessary to change it, there is indeed a problem with how kubernetes is handling the DNS resolution. In the attached pcap file you can see how for the same DNS name kubernetes some times makes the correct query and other times it decides to use search domains.

In the pcap you can see the following with more detail, but here is a short version:

Pod asks for arale-ng.cyw3ljy98zq7.eu-west-1.rds.amazonaws.com

At the host level we see this queries going to aws dns servers (10.0.0.2).

07:04:21.227278000  A? arale-ng.cyw3ljy98zq7.eu-west-1.rds.amazonaws.com -> resolves OK
07:04:17.630330000  A? arale-ng.cyw3ljy98zq7.eu-west-1.rds.amazonaws.com.svc.cluster.local -> fails

The pod is performing always the same query and dnsmasq is sometimes doing the right thing (forwarding the query “as is”) and other times is deciding to apply search domains on it. I think the behavior of ndots is consistent and does not explain this problem.

I have a similar problem. My application throws an exception for not being able to resolve database (Amazon RDS) host. Analyzing the DNS traffic produced at the time the exception is thrown, I can see that the previous DNS query resolves correctly to the RDS host (internal IP) but the following queries do not resolve correctly.

First they should try to resolve the hostname, and then, if this query fails, search domains should be used to try to resolve hostname. What happens here is that first query, using only the hostname, is not performed, and the first query is tried is using search domains. q

I’ve attached the filtered pcap where you can see the first query that is performed correctly and then the failed queries.

dns_filtered.pcap.zip

Regarding scalability, pod updates include all of the pod spec + status. The endpoints change less frequently as it is only when the IPs for Pods selected by services change.

TCP has the advantage of knowing when a conntrack entry can be deleted (FIN, RST etc)

If you look at the commit that added the conntrack removal, it was to solve the bug caused in the opposite direction, which was that a client with constant UDP traffic from the same socket will never switch over to a live endpoint due to the conntrack entry refresh (packets would go to a blackhole). We could delay the conntrack entry removal by looking at pod state (i.e graceful termination period), but kube-proxy does not have the information about pods (it would need to watch all pods and this is a major scalability issue).

One hack would be to introduce a small delay (say ~ 1 - 5 seconds) between iptables update and removal of conntrack entries. Most UDP protocols would respond in the allotted time so the existing responses can come back. Most DNS client libraries I have looked at use a new socket for each request, so the new requests post iptables update would not go to the removed backend.

I’m back to my original thinking on this, that kube-proxy should not be deleting UDP connections immediately on endpoint removal. Imagine if it did this for TCP connections - the shutdown grace period becomes useless. It needs to wait some period of time to give the terminating pods time to gracefully stop things. I see at least a few scenarios:

  1. Pod is deleted, starts terminating for its grace period.
    • Connection entries should be removed only when the pod exits.
  2. Pod goes NotReady.
    • Connection entries should be removed immediately, or up to some “connection entry grace period”.
  3. Endpoint is manually removed.
    • Same as NotReady scenario.

So an idea would be to detect (1) somehow, and then delay the connection entry deletion in that case (like check if the associated pod is in terminating status before removing).

Replacing Kube-DNS with CoreDNS resulted in the same bahaviour… Looks like the issue isn’t with DNS servers. The issue must be higher up in the Kubernetes DNS middleware.

@cmluciano we use the openjdk and the default for networkaddress.cache.ttl is 30 seconds according to https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/sun/net/InetAddressCachePolicy.java#L48. I verified by capturing traffic from a java app that is just doing a dns lookup in a loop for kinesis.us-east-1.amazonaws.com. I see requests hit the wire about every 30 seconds even though the loops are at 10 second intervals. Increasing this to 60 seconds may lighten the load on the name servers but dnsmasq is still refusing queries occasionally.