cilium: frequent dnsproxy timeouts
Is there an existing issue for this?
- I have searched the existing issues
What happened?
Periodically, for 10s of minutes, the dnsproxy will not respond to queries and have timeouts. As a dnsproxy handles more queries the likelihood of timeouts increases. So high traffic nodes it happens nearly immediately, while lower traffic nodes takes bit longer before they start exhibiting the same symptoms.
Cilium Version
Client: 1.13.3 36cb0eed 2023-05-17T12:31:14-04:00 go version go1.19.8 linux/amd64 Daemon: 1.13.3 36cb0eed 2023-05-17T12:31:14-04:00 go version go1.19.8 linux/amd64
Kernel Version
Linux ip-10-0-37-189 5.15.0-1036-aws #40~20.04.1-Ubuntu SMP Mon Apr 24 00:21:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
v1.24.13-eks-0a21954
Sysdump
No response
Relevant log output
level=warning msg="Lock acquisition time took longer than expected. Potentially too many parallel DNS requests being processed, consider adjusting --dnsproxy-lock-count and/or `" dnsName=bar.default.svc.cluster.local. duration=6.444839747s expected=1s subsys=daemon
level=warning msg="Lock acquisition time took longer than expected. Potentially too many parallel DNS requests being processed, consider adjusting --dnsproxy-lock-count and/or --dnsproxy-lock-timeout" dnsName=foo.bar.svc.cluster.local. duration=2.08938462s expected=1s subsys=daemon
level=error msg="Failed to dial connection to the upstream DNS server, cannot service DNS request" DNSRequestID=49079 dnsName=foo.bar.svc.cluster.local. endpointID=1176 error="failed to dial connection to 10.192.166.90:53: dial udp 10.192.166.90:53: i/o timeout" identity=75543 ipAddr="10.192.11.175:50687" subsys=fqdn/dnsproxy
Anything else?
Tried setting --dnsproxy-lock-count=1024 and --dnsproxy-lock-timeout=3s but that doesn’t prevent the issue
Highest traffic cluster’s upstream coredns pods report a total of ~120 qps
The highest traffic dnsproxy reports 20qps
There’s a correlation between pods that talk to s3 frequently and nodes that exhibit timeouts
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 28 (24 by maintainers)
Yeah, there are two rules when using DNS policies:
See https://docs.cilium.io/en/stable/security/policy/language/#dns-proxy for more details. It’s admittedly subtle. 🙂
Oh, that’s how our policies are setup now. We allow
cluster, but anything toworldneedstoFQDNrules.I’m going to try to figure how to trigger this in a test cluster. Right now it happens immediately in our largest prod clusters, and after minutes to hours (depending on cluster size) for the rest of our prod clusters.
Once I can repro this easily, is it
gopsthat I use to get the continuous profile you’re looking for?