amazon-vpc-cni-k8s: Intermittent DNS timeouts in a pod
We have a couple of jobs that run in a pod and the very first thing it’s trying to do is download a file from Github. These jobs fail intermittently once per a couple of days with a DNS resolution timeout.
Docker log:
time="2019-08-21T21:15:03Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/61b11a65981e7324715619bc5f9b9296e06ecea675a666f156fa98169a6a2849/shim.sock" debug=false pid=14981
CNI log:
2019-08-21T21:15:03.381Z [INFO] AssignPodIPv4Address: Assign IP 172.22.124.20 to pod (name uu-snowflake-updater-1566422100-xmtp9, namespace prod container 61b11a65981e7324715619bc5f9b9296e06ecea675a666f156fa98169a6a2849)
2019-08-21T21:15:03.381Z [INFO] Send AddNetworkReply: IPv4Addr 172.22.124.20, DeviceNumber: 0, err: <nil>
2019-08-21T21:15:03.382Z [INFO] Received add network response for pod uu-snowflake-updater-1566422100-xmtp9 namespace prod container 61b11a65981e7324715619bc5f9b9296e06ecea675a666f156fa98169a6a2849: 172.22.124.20, table 0, external-SNAT: false, vpcCIDR: [172.22.0.0/16]
2019-08-21T21:15:03.410Z [INFO] Added toContainer rule for 172.22.124.20/32 hostname:kubecd-prod-nodes-worker @timestamp:August 21st 2019, 17:15:39.000
Container log:
August 21st 2019, 17:15:03.701 % Total % Received % Xferd Average Speed Time Time Time Current
August 21st 2019, 17:15:03.701 Dload Upload Total Spent Left Speed
August 21st 2019, 17:15:08.771
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0curl: (6) Could not resolve host: raw.githubusercontent.com
There is basically a less than 300ms delay between CNI finishing setup for the IPtables and veths and curl making a request. Is there a chance for a race condition in this scenario? Since it happens rarely and is intermittent, it doesn’t seem to be a configuration issue.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 13
- Comments: 42 (10 by maintainers)
🤯
This should be very clearly documented somewhere.
@mogren, if that’s the case, then you might want to also update some documentation. Currently EKS docs refer to Kubernetes docs, which suggest configuring
force_tcp
for the node-local DNS cache.Had the same problem, reached out to aws they confirmed it and said they have no eta for the fix. Just thought i would update the thread with the info incase someone else had the problem
this is what they suggested
Hi @Eslamanwar! We have recently found out that using TCP for DNS lookups can cause issues. Do you mind checking that
options use-vc
is not set in the /etc/resolv.conf and thatforce_tcp
is not set for CoreDNS. Also, it’s best to make sure that the size of the DNS responses is less than 4096 bytes, to ensure it fits in the UDP packets.I also have a response from AWS Support team on my case:
Hello, any solution for that? We have terrible issues in production because of it
We were having the same issue here https://github.com/kubernetes/dns/issues/387 I’ve removed
force_tcp
flag from forwarder config in node-localdns but I definetly still see TCP requests and responses to the upstream AWS VPC resolver, and response times are not good - 4,2 and 1 seconds for about 1% of requests, however when I setprefer_udp
I see only UDP requests and responses and response times are all good. We usek8s.gcr.io/k8s-dns-node-cache:1.15.12
image.This just bit us. Any update on a fix? At least documenting this would be helpful.
We were facing the same issue in our clusters. We were using node-local-dns and in ints configuration, we had the
force_tcp
flag. We were getting lots of timeouts that way. After removing the flag, the timeouts went away.Seems like it might not be the best idea to upgrade to >1.5.1 when running eks 1.14 https://github.com/aws/containers-roadmap/issues/489
Could be related to https://github.com/coredns/coredns/pull/2769 if you’re using CoreDNS. Upgrading to >1.5.1 should fix the issue if that’s the issue you’re facing.
You can further verify that that’s the issue if
curl raw.githubusercontent.com
fails intermittently, butcurl --ipv4 raw.githubusercontent.com
never does.