dns: High level of i/o timeout erros in nodelocaldns pod when coredns pod on same node
Environment
coredns version 1.8.0 nodelocaldns version v1.17.0 kubernetes version : 1.20.8 kube-proxy in ipvs mode
Config
configmap:
data:
Corefile: |
.:53 {
errors
cache {
success 9984 30
denial 9984 5
}
reload
loop
bind 169.254.20.10
forward . 192.168.128.10 {
force_tcp
}
prometheus :9253
health 169.254.20.10:8080
}
Issue Description
We are seeing an issue with a high level of i/o timeout errors in the node-local-dns daemonset pods on those nodes where the coredns pod is also running .
Running
for pod in `kubectl -n kube-system get po -l k8s-app=node-local-dns -o name|cut -f2 -d '/'`;do echo "$(kubectl -n kube-system get po $pod -o wide --no-headers| awk '{ print $1,$7,$5 }') $(kubectl -n kube-system logs $pod|grep ERRO|wc -l)";done
When sorting above output based on error count we see massive increase in error count for those nodes where coredns is running.
Reproduce
Running a k8s job with a dnsperf pod in it - with 3 fqdns to lookup - 1 on cluster service, one service in AWS and one external service
On a node where coredns is also running
Statistics:
Queries sent: 1500000
Queries completed: 1500000 (100.00%)
Queries lost: 0 (0.00%)
Response codes: NOERROR 1499884 (99.99%), SERVFAIL 116 (0.01%)
Average packet size: request 52, response 224
Run time (s): 300.000132
Queries per second: 4999.997800
Average Latency (s): 0.000260 (min 0.000044, max 1.067598)
Latency StdDev (s): 0.012459
On a node where coredns is not also running
Statistics:
Queries sent: 1500000
Queries completed: 1500000 (100.00%)
Queries lost: 0 (0.00%)
Response codes: NOERROR 1500000 (100.00%)
Average packet size: request 52, response 224
Run time (s): 300.000153
Queries per second: 4999.997450
Average Latency (s): 0.000208 (min 0.000045, max 1.082059)
Latency StdDev (s): 0.010506
cluster service resolution vs external resolution
when test set is all k8s service fqdns - SERVFAIL errors much lower than when test set is all external fqdns.
On node where no coredns is running: 0 servfail On node where coredns is running and all on cluster fqdns: 8 servfail from 1500000 queries On node where coredns is running and all on external fqdns: 164 servfail from 1500000 queries
When we scale coredns down to one replica and run dns perf on the same node we get no servfails int he test
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 31 (18 by maintainers)
In my case, I think
coreDNSis working fine (it does not make any timeout). Howevernodelocaldnsrarely makes any connection withcoreDNSservice(192.168.0.3).@rahul-paigavan In most modern K8S systems the DNS and TLS layer is abused by not properly using Keep Alive (HTTP / TCP) of the connections, this causing DNS to be queiried as well TLS to be handshaked on each request - here is a great blog post about that https://www.lob.com/blog/use-http-keep-alive in the context of NodeJS, but others are similar.
Other than that you should check and optimize the DNS zones inside the Nodelocal, look for:
Overall the DNS and TLS layer is the most abused and most neglected layer by Engineers, but it is the most important thing in a distrubuted, clusterized system - so put decent enough of your time and efforts in to that as you will regret it!
I think I have the same issue. There are quite a lot of timeout requests from
nodelocaldnstocoredns. (about 70% of DNS requests are timeout in worst case node) And I just realized that only nodes havingcorednspod have this issue as @rtmie said.