cilium-cli: flake: `allow-all/pod-to-world/http(s)-to-cilium-io` fails
Observed in https://github.com/cilium/cilium-cli/pull/366 on EKS:
https://github.com/cilium/cilium-cli/pull/366/checks?check_run_id=2905226287
The flow logs indicate that a DNS request was made, but no TCP connection was ever established:
[.] Action [allow-all/pod-to-world/http-to-www-google: cilium-test/client-7b7bf54b85-h6qvt (10.0.1.205) -> www-google-http (www.google.com:80)]
❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://www.google.com:80" failed: command terminated with exit code 28
📄 Matching flows for pod cilium-test/client-7b7bf54b85-h6qvt
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
⌛ Waiting (5s) for flows: Required flows not found yet
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
ℹ️ SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
✅ DNS request found at 0
✅ DNS response found at 3
❌ Flow validation failed for pod cilium-test/client-7b7bf54b85-h6qvt: 1 failures (first: 0, last: 3, matched: 2)
It’s not clear why there was never any TCP outgoing connection from curl. Unfortunately we don’t seem to collect any L7 flows which would give us more insight into what the DNS response was.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 31 (31 by maintainers)
Commits related to this issue
- test: Add DNS visibility annotations to client deployments Bring back the DNS visibility annotation (466e8f1) that got reverted in PR #356. I reverted it because it depended on 07a161d which was caus... — committed to cilium/cilium-cli by jrajahalme 3 years ago
- test: Add DNS visibility annotations to client deployments Bring back the DNS visibility annotation (466e8f1) that got reverted in PR #356. I reverted the commit because it depended on 07a161d which ... — committed to cilium/cilium-cli by jrajahalme 3 years ago
- check: Replace [www.]google.com with [jenkins.]cilium.io Use [jenkins.]cilium.io for FQDN tests to reduce external dependencies. Ref: #367 Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> — committed to cilium/cilium-cli by michi-covalent 3 years ago
- check: Replace [www.]google.com with [jenkins.]cilium.io Use [jenkins.]cilium.io for FQDN tests to reduce external dependencies. Ref: #367 Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> — committed to cilium/cilium-cli by michi-covalent 3 years ago
- check: Replace [www.]google.com with [jenkins.]cilium.io Use [jenkins.]cilium.io for FQDN tests to reduce external dependencies. Ref: #367 Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> — committed to cilium/cilium-cli by michi-covalent 3 years ago
- check: Replace [www.]google.com with [jenkins.]cilium.io Use [jenkins.]cilium.io for FQDN tests to reduce external dependencies. Ref: #367 Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> — committed to cilium/cilium-cli by michi-covalent 3 years ago
- Switch pod to world testing from `cilium.io` to `one.one.one.one` Tentative fix for #367. We have two main hypothesis around why connections to external domains are flaky: - The external domain its... — committed to cilium/cilium-cli by nbusseneau 3 years ago
- Switch pod to world testing from `cilium.io` to `one.one.one.one` Tentative fix for #367. We have two main hypothesis around why connections to external domains are flaky: - The external domain its... — committed to cilium/cilium-cli by nbusseneau 3 years ago
- Switch pod to world testing from `cilium.io` to `one.one.one.one` Tentative fix for #367. We have two main hypothesis around why connections to external domains are flaky: - The external domain its... — committed to cilium/cilium-cli by nbusseneau 3 years ago
- Switch pod to world testing from `cilium.io` to `one.one.one.one` Tentative fix for #367. We have two main hypothesis around why connections to external domains are flaky: - The external domain its... — committed to cilium/cilium-cli by nbusseneau 3 years ago
- experiment,ci: bump dynamic map sizes to .5 in response to ticket: https://github.com/cilium/cilium-cli/issues/367 this pull request bumps all dynamically sized maps to 1/2 the available memory. map... — committed to cilium/cilium-cli by ldelossa 3 years ago
- experiment,ci: bump dynamic map sizes to .5 in response to ticket: https://github.com/cilium/cilium-cli/issues/367 this pull request bumps all dynamically sized maps to .9 the available memory. map ... — committed to cilium/cilium-cli by ldelossa 3 years ago
- test: Run tests with UDP proxy redirections last For the past three months [1], we've had a flake in the connectivity checks where DNS resolutions sometimes fail. Two weeks ago, Louis noticed that th... — committed to cilium/cilium-cli by pchaigno 3 years ago
- test: Run tests with UDP proxy redirections last For the past three months [1], we've had a flake in the connectivity checks where DNS resolutions sometimes fail. Two weeks ago, Louis noticed that th... — committed to cilium/cilium-cli by pchaigno 3 years ago
- test: Run tests with UDP proxy redirections last For the past three months [1], we've had a flake in the connectivity checks where DNS resolutions sometimes fail. Two weeks ago, Louis noticed that th... — committed to cilium/cilium-cli by pchaigno 3 years ago
- check: Replace [www.]google.com with [jenkins.]cilium.io Use [jenkins.]cilium.io for FQDN tests to reduce external dependencies. Ref: #367 Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> — committed to aditighag/cilium-cli by michi-covalent 3 years ago
- Switch pod to world testing from `cilium.io` to `one.one.one.one` Tentative fix for #367. We have two main hypothesis around why connections to external domains are flaky: - The external domain its... — committed to aditighag/cilium-cli by nbusseneau 3 years ago
- test: Run tests with UDP proxy redirections last For the past three months [1], we've had a flake in the connectivity checks where DNS resolutions sometimes fail. Two weeks ago, Louis noticed that th... — committed to aditighag/cilium-cli by pchaigno 3 years ago
- test: Run tests with UDP proxy redirections last For the past three months [1], we've had a flake in the connectivity checks where DNS resolutions sometimes fail. Two weeks ago, Louis noticed that th... — committed to michi-covalent/cilium by pchaigno 3 years ago
I’ve performed evaluations of the different mitigations, by running the connectivity tests a hundred times:
conntrack-gc-interval=1m(defaults to 5m)conntrack-gc-interval=30s,bpf-ct-timeout-regular-any=30s(defaults to 1m)Reordering tests corresponds to a patched CLI to guarantee tests will run in the order: no-policies, allow-all, client-ingress, echo-ingress, client-egress, to-entities-world, to-cidr-1111, echo-ingress-l7, client-egress-l7, dns-only, to-fqdns. The last three tests are the only with UDP redirects to the proxy. For that evaluation, I also flushed the CT between each full run of the connectivity tests instead of rebuilding the cluster.
I’ll send a PR to reorder the tests as a temporary workaround.
Following discussion in community meeting:
one.one.one.onefirst, then:The issue uncovered by these flaky tests is now tracked at https://github.com/cilium/cilium/issues/17459.
@pchaigno Do we actually see this issue in the wild ?
I’d like to consider whether this is just an issue during our tests or if we think users of Cilium will suffer from this.
Do we believe users will be churning network policies as often as our automated test suite does?
I ask because there are several trivial fixes we can implement today in our testing suite to solve this.
Root Cause Analysis
From https://github.com/cilium/cilium-cli/issues/367#issuecomment-920218323, one hypothesis we had is the following sequence of events:
There is a timing constraints here: the new UDP connection should happen before GC has a chance to clean the old entry. GC runs every five minutes by default.
A collision on the source port also seems fairly unlikely, except if you take into account that Linux has a limited allocation range (~28k) and look at this as a sort of birthday paradox.
We should be able to confirm this using Hubble logs, to check that a previous UDP connection used the same port less than five minutes before. Unfortunately, Hubble didn’t offer a way to differentiate between proxy redirects policy-verdicts and other policy-verdicts. We also didn’t collect enough Hubble flows in our test. I fixed the first issue and opened a PR to increase the Hubble buffer size: https://github.com/cilium/cilium/pull/17429.
After a little while, the tests failed in that PR: https://github.com/cilium/cilium/pull/17429/checks?check_run_id=3640670462. At 14:43:05.952, we can see the usual
to-proxyfrom client2 on port 46962 to CoreDNS. Checking the Hubble flows for that same nodes, we find the following:So a first connection was established on that same 5-tuple, less than 5 minutes before, and redirected to the proxy. This seems to confirm the root cause is an incorrect redirect to the proxy from the stale UDP CT entry of a previous connection.
Workaround
A first workaround would be to flush the CT table after applying new network security policies. A second option would be to decrease
conntrack-gc-intervalso that GC of stale CT entries happens more often. The second option is much less disruptive.None of those options are viable for cilium-cli however since we need a solution that works for users, without requiring that they edit their configmap and without reestablishing all existing connections obviously 😸
Solution
Maybe a better solution here would be to have a grace period in the userspace proxy, before we actually start to drop packet for existing connections that shouldn’t be redirected to the proxy anymore. I noticed that the TPROXY redirects were still in place in the datapath, so I’m assuming it’s the proxy who drops packets here. /cc @jrajahalme
Ah right, only for ClusterIP services IIRC which is the case here 👍
I was able to confirm this behavior even easier by setting the
ctmap size to 2048 in the cilium config map.Once we hit enough nat entries we see the demonstrated behavior.
The outstanding question is whether we are evicting records erroroneously from the
cttable or from thenattable.Once this is determined the immediate fix to to allocate more space to nat related bpf maps in CI.
A long term fix is a bit more involved and revolves around the conn tracking GC algo. This will most likely be a new proposal.
I can now readily reproduce this.
The CT map has a dynamic garbage collection interval. If you wait long enough you’ll see the CT map garbage collection interval get sufficiently large.
Once this happens, you can begin spamming kube-dns with the same curl commands which fail in our connectivity test suite.
Doing this from a few shells will begin to pack the CT map with out going CT entries for the kube-dns service ip.
Once the map gets sufficient large (about 75,000 entries in my testing), it begins to delete recent entries in small batches, I do not believe this is GC doing this, maybe some other overflow mechanism? When it begins to delete these overflows curl will begin to fail.
Finally a proper CT map GC occurs with a massive deleteRatio:
From here, the new GC interval kicks in, and the CT map is cleared out more efficiently, curl stops failing.
I went down the rabbit hole on this.
I’m attching two pcaps which demonstrate the issue when curl hangs.
The relevant summary from this pcaps follows:
Successful DNS requests utilize the
/etc/resolv.confresolver:In the pcaps you’ll see successful DNS request flows with
10.155.240.10as src and destination.When curl begins to hang, you’ll see that the return traffic for a DNS request to
10.155.240.10is from source10.152.0.119. When the return packet with this source hits the network stack an ICMP port unreachanle message is returned to the source.This
10.152.0.199ip address is the endpoint ip for kube-dns in my reproduction environment:As a bit of summary, what i think is going on here is:
I believe subsequent debugging should investigate why we are failing to un-DNAT in certain scenarios.
Also, here are the flows in the pcap indicating that a request from a ephemeral source port destinted to the
/etc/resolv.confhas its return traffic ,which is destined to the same ephemeral port, sourced from the non nat’d kube-dns address.dns-flake-pcaps.zip
thanks @nbusseneau
i opened a draft pr to bring back dns visibility annotation https://github.com/cilium/cilium-cli/pull/510 hoping to hit the flake and get some more information about what’s failing 👀
#373 changed the connectivity tests to use
*.cilium.ioinstead of*.google.com, but it seems like failures of this kind still occur, see e.g. https://github.com/cilium/cilium-cli/runs/2969863237.