cilium-cli: flake: `allow-all/pod-to-world/http(s)-to-cilium-io` fails

Observed in https://github.com/cilium/cilium-cli/pull/366 on EKS:

https://github.com/cilium/cilium-cli/pull/366/checks?check_run_id=2905226287

The flow logs indicate that a DNS request was made, but no TCP connection was ever established:

   [.] Action [allow-all/pod-to-world/http-to-www-google: cilium-test/client-7b7bf54b85-h6qvt (10.0.1.205) -> www-google-http (www.google.com:80)]
  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://www.google.com:80" failed: command terminated with exit code 28
  📄 Matching flows for pod cilium-test/client-7b7bf54b85-h6qvt
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ⌛ Waiting (5s) for flows: Required flows not found yet
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ℹ️  SYN and(ip(src=10.0.1.205),tcp(dstPort=80),tcpflags(syn)) not found
  ✅ DNS request found at 0
  ✅ DNS response found at 3
  ❌ Flow validation failed for pod cilium-test/client-7b7bf54b85-h6qvt: 1 failures (first: 0, last: 3, matched: 2)

It’s not clear why there was never any TCP outgoing connection from curl. Unfortunately we don’t seem to collect any L7 flows which would give us more insight into what the DNS response was.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 31 (31 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve performed evaluations of the different mitigations, by running the connectivity tests a hundred times:

Mitigation Rate of failures
No mitigation 49/100
conntrack-gc-interval=1m (defaults to 5m) 16/100
conntrack-gc-interval=30s, bpf-ct-timeout-regular-any=30s (defaults to 1m) 1/100
Reordering tests 0/100

Reordering tests corresponds to a patched CLI to guarantee tests will run in the order: no-policies, allow-all, client-ingress, echo-ingress, client-egress, to-entities-world, to-cidr-1111, echo-ingress-l7, client-egress-l7, dns-only, to-fqdns. The last three tests are the only with UDP redirects to the proxy. For that evaluation, I also flushed the CT between each full run of the connectivity tests instead of rebuilding the cluster.

I’ll send a PR to reorder the tests as a temporary workaround.

Following discussion in community meeting:

The issue uncovered by these flaky tests is now tracked at https://github.com/cilium/cilium/issues/17459.

@pchaigno Do we actually see this issue in the wild ?

I’d like to consider whether this is just an issue during our tests or if we think users of Cilium will suffer from this.

Do we believe users will be churning network policies as often as our automated test suite does?

I ask because there are several trivial fixes we can implement today in our testing suite to solve this.

Root Cause Analysis

From https://github.com/cilium/cilium-cli/issues/367#issuecomment-920218323, one hypothesis we had is the following sequence of events:

  1. L7 policies are applied, causing redirections to the userspace proxy.
  2. A UDP connection is established with source port X and the CT map tracks it.
  3. L7 policies are removed, so redirections to the proxy shouldn’t happen anymore.
  4. A new UDP connection is established but it reuses source port X and therefore matches the existing CT entry. As a result, it is redirected to the userspace proxy, which drops the packet(s).

There is a timing constraints here: the new UDP connection should happen before GC has a chance to clean the old entry. GC runs every five minutes by default.

A collision on the source port also seems fairly unlikely, except if you take into account that Linux has a limited allocation range (~28k) and look at this as a sort of birthday paradox.

We should be able to confirm this using Hubble logs, to check that a previous UDP connection used the same port less than five minutes before. Unfortunately, Hubble didn’t offer a way to differentiate between proxy redirects policy-verdicts and other policy-verdicts. We also didn’t collect enough Hubble flows in our test. I fixed the first issue and opened a PR to increase the Hubble buffer size: https://github.com/cilium/cilium/pull/17429.

After a little while, the tests failed in that PR: https://github.com/cilium/cilium/pull/17429/checks?check_run_id=3640670462. At 14:43:05.952, we can see the usual to-proxy from client2 on port 46962 to CoreDNS. Checking the Hubble flows for that same nodes, we find the following:

$ cat hubble-confirm-cli-flake.json | ./hubble observe -t policy-verdict | grep 46962
Sep 18 14:39:46.618: cilium-test/client2-65f446d77c-7k5vl:46962 -> kube-system/coredns-d4866bcb7-42s9j:53 L3-L4 REDIRECTED (UDP)

So a first connection was established on that same 5-tuple, less than 5 minutes before, and redirected to the proxy. This seems to confirm the root cause is an incorrect redirect to the proxy from the stale UDP CT entry of a previous connection.

Workaround

A first workaround would be to flush the CT table after applying new network security policies. A second option would be to decrease conntrack-gc-interval so that GC of stale CT entries happens more often. The second option is much less disruptive.

None of those options are viable for cilium-cli however since we need a solution that works for users, without requiring that they edit their configmap and without reestablishing all existing connections obviously 😸

Solution

Maybe a better solution here would be to have a grace period in the userspace proxy, before we actually start to drop packet for existing connections that shouldn’t be redirected to the proxy anymore. I noticed that the TPROXY redirects were still in place in the datapath, so I’m assuming it’s the proxy who drops packets here. /cc @jrajahalme

I was under the impression that, even with KPR disabled, bpf_lxc still performs service <> pod translation and places CT and Nat entries into the appropriate bpf maps.

Ah right, only for ClusterIP services IIRC which is the case here 👍

I was able to confirm this behavior even easier by setting the ct map size to 2048 in the cilium config map.

Once we hit enough nat entries we see the demonstrated behavior.

The outstanding question is whether we are evicting records erroroneously from the ct table or from the nat table.

Once this is determined the immediate fix to to allocate more space to nat related bpf maps in CI.

A long term fix is a bit more involved and revolves around the conn tracking GC algo. This will most likely be a new proposal.

I can now readily reproduce this.

The CT map has a dynamic garbage collection interval. If you wait long enough you’ll see the CT map garbage collection interval get sufficiently large.

2021-09-02T17:50:54.021808954Z level=info msg="Conntrack garbage collector interval recalculated" deleteRatio=0.007598876953125 newInterval=17m45s subsys=map-ct

Once this happens, you can begin spamming kube-dns with the same curl commands which fail in our connectivity test suite.

 while curl --silent --fail --show-error --connect-timeout 5 --output /dev/null http://jenkins.cilium.io:80; do echo "ok $(date)"; done

Doing this from a few shells will begin to pack the CT map with out going CT entries for the kube-dns service ip.

Once the map gets sufficient large (about 75,000 entries in my testing), it begins to delete recent entries in small batches, I do not believe this is GC doing this, maybe some other overflow mechanism? When it begins to delete these overflows curl will begin to fail.

Finally a proper CT map GC occurs with a massive deleteRatio:

2021-09-02T18:08:39.622202292Z level=info msg="Conntrack garbage collector interval recalculated" deleteRatio=0.9 newInterval=1m46s subsys=map-ct

From here, the new GC interval kicks in, and the CT map is cleared out more efficiently, curl stops failing.

I went down the rabbit hole on this.

I’m attching two pcaps which demonstrate the issue when curl hangs.

The relevant summary from this pcaps follows:

Successful DNS requests utilize the /etc/resolv.confresolver:

[root@client-57ddfcd474-zqmbl /]# cat /etc/resolv.conf 
search cilium-test.svc.cluster.local svc.cluster.local cluster.local c.cilium-dev.internal google.internal
nameserver 10.155.240.10

In the pcaps you’ll see successful DNS request flows with 10.155.240.10 as src and destination.

When curl begins to hang, you’ll see that the return traffic for a DNS request to 10.155.240.10 is from source 10.152.0.119. When the return packet with this source hits the network stack an ICMP port unreachanle message is returned to the source.

This 10.152.0.199 ip address is the endpoint ip for kube-dns in my reproduction environment:

ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                              IPv6   IPv4           STATUS   
           ENFORCEMENT        ENFORCEMENT                                                                                                 
495        Disabled           Disabled          19920      k8s:io.cilium.k8s.policy.cluster=ldelossa-dev                   10.152.0.119   ready   
                                                           k8s:io.cilium.k8s.policy.serviceaccount=kube-dns                                       
                                                           k8s:io.kubernetes.pod.namespace=kube-system                                            
                                                           k8s:k8s-app=kube-dns   

As a bit of summary, what i think is going on here is:

  • curl is issued inside the container
  • dns request is sent to the resolver in /etc/resolv.conf (10.155.240.10)
  • request goes thru the typical kube-proxy iptable rules and gets DNAT’d to kube-dns endpoint ip (10.152.0.199)
  • intermittently, the un-DNAT from 10.152.0.199 -> 10.152.0.199 never occurs
  • the un-DNAT’d response hits the network stack and there’s no socket associated with this erreronous network tuple
  • curl sits there and hangs until timeout since no return traffic gets back to it.

I believe subsequent debugging should investigate why we are failing to un-DNAT in certain scenarios.

Also, here are the flows in the pcap indicating that a request from a ephemeral source port destinted to the /etc/resolv.conf has its return traffic ,which is destined to the same ephemeral port, sourced from the non nat’d kube-dns address.

  tcpdump -r tcpdump-dns-failure2.pcap port 39589 or proto ICMP
reading from file tcpdump-dns-failure2.pcap, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 65535
Warning: interface names might be incorrect
02:17:23.988204 ?     Out IP 10.152.0.181.39589 > 10.155.240.10.domain: 25286+ A? jenkins.cilium.io.cluster.local. (49)
02:17:23.988632 ?     In  IP 10.152.0.119.domain > 10.152.0.181.39589: 25286 NXDomain 0/1/0 (142)
02:17:23.988643 ?     Out IP 10.152.0.181 > 10.152.0.119: ICMP 10.152.0.181 udp port 39589 unreachable, length 178
02:17:23.988757 ?     Out IP 10.152.0.181.39589 > 10.155.240.10.domain: 48320+ AAAA? jenkins.cilium.io.cluster.local. (49)
02:17:23.989152 ?     In  IP 10.152.0.119.domain > 10.152.0.181.39589: 48320 NXDomain 0/1/0 (142)
02:17:23.989161 ?     Out IP 10.152.0.181 > 10.152.0.119: ICMP 10.152.0.181 udp port 39589 unreachable, length 178
02:17:28.993121 ?     Out IP 10.152.0.181.39589 > 10.155.240.10.domain: 25286+ A? jenkins.cilium.io.cluster.local. (49)
02:17:28.993575 ?     Out IP 10.152.0.181.39589 > 10.155.240.10.domain: 48320+ AAAA? jenkins.cilium.io.cluster.local. (49)
02:17:28.993728 ?     In  IP 10.152.0.119.domain > 10.152.0.181.39589: 25286 NXDomain 0/1/0 (142)
02:17:28.993737 ?     Out IP 10.152.0.181 > 10.152.0.119: ICMP 10.152.0.181 udp port 39589 unreachable, length 178
02:17:28.994197 ?     In  IP 10.152.0.119.domain > 10.152.0.181.39589: 48320 NXDomain 0/1/0 (142)
02:17:28.994205 ?     Out IP 10.152.0.181 > 10.152.0.119: ICMP 10.152.0.181 udp port 39589 unreachable, length 178

dns-flake-pcaps.zip

thanks @nbusseneau

i opened a draft pr to bring back dns visibility annotation https://github.com/cilium/cilium-cli/pull/510 hoping to hit the flake and get some more information about what’s failing 👀

#373 changed the connectivity tests to use *.cilium.io instead of *.google.com, but it seems like failures of this kind still occur, see e.g. https://github.com/cilium/cilium-cli/runs/2969863237.