amazon-vpc-cni-k8s: Intermittent DNS timeouts in a pod

We have a couple of jobs that run in a pod and the very first thing it’s trying to do is download a file from Github. These jobs fail intermittently once per a couple of days with a DNS resolution timeout.

Docker log:

time="2019-08-21T21:15:03Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/61b11a65981e7324715619bc5f9b9296e06ecea675a666f156fa98169a6a2849/shim.sock" debug=false pid=14981

CNI log:

2019-08-21T21:15:03.381Z [INFO]	AssignPodIPv4Address: Assign IP 172.22.124.20 to pod (name uu-snowflake-updater-1566422100-xmtp9, namespace prod container 61b11a65981e7324715619bc5f9b9296e06ecea675a666f156fa98169a6a2849)
2019-08-21T21:15:03.381Z [INFO]	Send AddNetworkReply: IPv4Addr 172.22.124.20, DeviceNumber: 0, err: <nil>
2019-08-21T21:15:03.382Z [INFO]	Received add network response for pod uu-snowflake-updater-1566422100-xmtp9 namespace prod container 61b11a65981e7324715619bc5f9b9296e06ecea675a666f156fa98169a6a2849: 172.22.124.20, table 0, external-SNAT: false, vpcCIDR: [172.22.0.0/16]
2019-08-21T21:15:03.410Z [INFO]	Added toContainer rule for 172.22.124.20/32 hostname:kubecd-prod-nodes-worker @timestamp:August 21st 2019, 17:15:39.000

Container log:

August 21st 2019, 17:15:03.701	  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
August 21st 2019, 17:15:03.701	                                 Dload  Upload   Total   Spent    Left  Speed
August 21st 2019, 17:15:08.771	
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0curl: (6) Could not resolve host: raw.githubusercontent.com

There is basically a less than 300ms delay between CNI finishing setup for the IPtables and veths and curl making a request. Is there a chance for a race condition in this scenario? Since it happens rarely and is intermittent, it doesn’t seem to be a configuration issue.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 13
Comments: 42 (10 by maintainers)

Commits related to this issue

Make a clear comment to not use `force_tcp` for CoreDNS We have seen multiple issues around DNS because of an underlying limitation in EC2 when it comes to DNS lookups over TCP. The suggested work-ar... — committed to mogren/amazon-eks-user-guide by deleted user 4 years ago

Most upvoted comments

🤯

We plan to address the limitations on TCP DNS queries for Nitro instances, but do not have an ETA for the fix yet.

This should be very clearly documented somewhere.

+17

Vlaaaaaaad on Jul 15, 2020

@mogren, if that’s the case, then you might want to also update some documentation. Currently EKS docs refer to Kubernetes docs, which suggest configuring force_tcp for the node-local DNS cache.

deiwin on Jun 26, 2020

Had the same problem, reached out to aws they confirmed it and said they have no eta for the fix. Just thought i would update the thread with the info incase someone else had the problem

this is what they suggested

 Workaround:

1. Linux users should ensure 'options use-vc’ does not appear in the /etc/resolv.conf
2. Users of CoreDNS should ensure that the option ‘force_tcp’ is not enabled.
3. Customers should make the size of the DNS responses is less than 4096 bytes to ensure it fits in UDP packets.

rmenn on Nov 10, 2021

Hi @Eslamanwar! We have recently found out that using TCP for DNS lookups can cause issues. Do you mind checking that options use-vc is not set in the /etc/resolv.conf and that force_tcp is not set for CoreDNS. Also, it’s best to make sure that the size of the DNS responses is less than 4096 bytes, to ensure it fits in the UDP packets.

mogren on Jun 25, 2020

I also have a response from AWS Support team on my case:

We have identified that you were using TCP DNS connections to the VPC Resolver. We have identified the root cause as limitations in the current TCP DNS handling on EC2 Nitro Instances. The software which forwards DNS requests to our fleet for resolution is limited to 2 simultaneous TCP connections and blocks on TCP queries for each connection. Volume exceeding 2 simultaneous requests will result in increased latency. It is our recommendation that you prefer UDP DNS lookups to prevent an increase in latency. This should provide an optimal path for DNS requests that are less than 4096 bytes, and minimize the TCP latency to DNS names which exceed 4096 bytes. We plan to address the limitations on TCP DNS queries for Nitro instances, but do not have an ETA for the fix yet.

123BLiN on Jul 15, 2020

Hello, any solution for that? We have terrible issues in production because of it

Shahard2 on May 23, 2021

We were having the same issue here https://github.com/kubernetes/dns/issues/387 I’ve removed force_tcp flag from forwarder config in node-localdns but I definetly still see TCP requests and responses to the upstream AWS VPC resolver, and response times are not good - 4,2 and 1 seconds for about 1% of requests, however when I set prefer_udp I see only UDP requests and responses and response times are all good. We use k8s.gcr.io/k8s-dns-node-cache:1.15.12 image.

123BLiN on Jul 13, 2020

This just bit us. Any update on a fix? At least documenting this would be helpful.

irlevesque on Mar 25, 2021

We were facing the same issue in our clusters. We were using node-local-dns and in ints configuration, we had the force_tcp flag. We were getting lots of timeouts that way. After removing the flag, the timeouts went away.

gurumaia on Jul 10, 2020

Seems like it might not be the best idea to upgrade to >1.5.1 when running eks 1.14 https://github.com/aws/containers-roadmap/issues/489

in addition to those performance issues, the proxy plugin is deprecated in future releases of coredns so upgrading from 1.3.1 to 1.5.2 in existing clusters using the same configmap won’t be successful.

jnaulty on Oct 2, 2019

Could be related to https://github.com/coredns/coredns/pull/2769 if you’re using CoreDNS. Upgrading to >1.5.1 should fix the issue if that’s the issue you’re facing.

You can further verify that that’s the issue if curl raw.githubusercontent.com fails intermittently, but curl --ipv4 raw.githubusercontent.com never does.

deiwin on Oct 1, 2019