cilium: frequent dnsproxy timeouts

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Periodically, for 10s of minutes, the dnsproxy will not respond to queries and have timeouts. As a dnsproxy handles more queries the likelihood of timeouts increases. So high traffic nodes it happens nearly immediately, while lower traffic nodes takes bit longer before they start exhibiting the same symptoms.

Cilium Version

Client: 1.13.3 36cb0eed 2023-05-17T12:31:14-04:00 go version go1.19.8 linux/amd64 Daemon: 1.13.3 36cb0eed 2023-05-17T12:31:14-04:00 go version go1.19.8 linux/amd64

Kernel Version

Linux ip-10-0-37-189 5.15.0-1036-aws #40~20.04.1-Ubuntu SMP Mon Apr 24 00:21:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

v1.24.13-eks-0a21954

Sysdump

No response

Relevant log output

level=warning msg="Lock acquisition time took longer than expected. Potentially too many parallel DNS requests being processed, consider adjusting --dnsproxy-lock-count and/or `" dnsName=bar.default.svc.cluster.local. duration=6.444839747s expected=1s subsys=daemon
level=warning msg="Lock acquisition time took longer than expected. Potentially too many parallel DNS requests being processed, consider adjusting --dnsproxy-lock-count and/or --dnsproxy-lock-timeout" dnsName=foo.bar.svc.cluster.local. duration=2.08938462s expected=1s subsys=daemon
level=error msg="Failed to dial connection to the upstream DNS server, cannot service DNS request" DNSRequestID=49079 dnsName=foo.bar.svc.cluster.local. endpointID=1176 error="failed to dial connection to 10.192.166.90:53: dial udp 10.192.166.90:53: i/o timeout" identity=75543 ipAddr="10.192.11.175:50687" subsys=fqdn/dnsproxy

Anything else?

Tried setting --dnsproxy-lock-count=1024 and --dnsproxy-lock-timeout=3s but that doesn’t prevent the issue

Highest traffic cluster’s upstream coredns pods report a total of ~120 qps

The highest traffic dnsproxy reports 20qps

There’s a correlation between pods that talk to s3 frequently and nodes that exhibit timeouts

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 28 (24 by maintainers)

Most upvoted comments

Yeah, there are two rules when using DNS policies:

  1. DNS L7 rule
  2. FQDN allow rules

See https://docs.cilium.io/en/stable/security/policy/language/#dns-proxy for more details. It’s admittedly subtle. 🙂

Could you try to keep those queries as part of the first policy and but for the second? Meaning keep matchPattern: '*' for (1) and the “everything except …” for (2). I’d be interested to know if that reduces the load enough.

Oh, that’s how our policies are setup now. We allow cluster, but anything to world needs toFQDN rules.

If that doesn’t do it, then I’m afraid that a continuous profile of Cilium whenever the timeouts are occurring would be the only way to debug it further.

I’m going to try to figure how to trigger this in a test cluster. Right now it happens immediately in our largest prod clusters, and after minutes to hours (depending on cluster size) for the rest of our prod clusters.

Once I can repro this easily, is it gops that I use to get the continuous profile you’re looking for?