cilium-cli: cilium connectivity test failures

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

When running cilium connectivity test, several tests are failing. no-policies, client-egress-l7, and to-fqdns. They all fail due to curl exiting with a code 28.

The thing is, when I run a curlimages/curl container and run the commands listed in the test output (removing the -w, and --output flags), it downloads the page just fine.

This is on a new Kubernetes 1.23 cluster. Cilium was the first thing I installed. Ubuntu 20.04 is the host OS. It is running FirewallD for a host level firewall.

Is this something I should try to fix, or can I ignore these tests?

Cilium Version

cilium-cli: v0.10.0 compiled with go1.17.4 on linux/amd64 cilium image (default): v1.11.0 cilium image (stable): v1.11.0 cilium image (running): v1.11.0

Kernel Version

Linux controlplane 5.4.0-91-generic cilium/cilium#102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Client Version: version.Info{Major:“1”, Minor:“23”, GitVersion:“v1.23.1”, GitCommit:“86ec240af8cbd1b60bcc4c03c20da9b98005b92e”, GitTreeState:“clean”, BuildDate:“2021-12-16T11:41:01Z”, GoVersion:“go1.17.5”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“23”, GitVersion:“v1.23.1”, GitCommit:“86ec240af8cbd1b60bcc4c03c20da9b98005b92e”, GitTreeState:“clean”, BuildDate:“2021-12-16T11:34:54Z”, GoVersion:“go1.17.5”, Compiler:“gc”, Platform:“linux/amd64”}

Sysdump

No response

Relevant log output

root@controlplane:~# cilium connectivity test
ℹ️  Monitor aggregation detected, will skip some flow validation steps
⌛ [clustername] Waiting for deployments [client client2 echo-same-node] to become ready...
⌛ [clustername] Waiting for deployments [echo-other-node] to become ready...
⌛ [clustername] Waiting for CiliumEndpoint for pod cilium-test/client-7568bc7f86-2mdgt to appear...
⌛ [clustername] Waiting for CiliumEndpoint for pod cilium-test/client2-686d5f784b-5llc9 to appear...
⌛ [clustername] Waiting for CiliumEndpoint for pod cilium-test/echo-other-node-59d779959c-2jggr to appear...
⌛ [clustername] Waiting for CiliumEndpoint for pod cilium-test/echo-same-node-5767b7b99d-rmcqc to appear...
⌛ [clustername] Waiting for Service cilium-test/echo-same-node to become ready...
⌛ [clustername] Waiting for Service cilium-test/echo-other-node to become ready...
⌛ [clustername] Waiting for NodePort nnn.nnn.nnn.8:31231 (cilium-test/echo-other-node) to become ready...
⌛ [clustername] Waiting for NodePort nnn.nnn.nnn.8:32187 (cilium-test/echo-same-node) to become ready...
⌛ [clustername] Waiting for NodePort nnn.nnn.nnn.9:31231 (cilium-test/echo-other-node) to become ready...
⌛ [clustername] Waiting for NodePort nnn.nnn.nnn.9:32187 (cilium-test/echo-same-node) to become ready...
⌛ [clustername] Waiting for NodePort nnn.nnn.nnn.207:31231 (cilium-test/echo-other-node) to become ready...
⌛ [clustername] Waiting for NodePort nnn.nnn.nnn.207:32187 (cilium-test/echo-same-node) to become ready...
⌛ [clustername] Waiting for NodePort nnn.nnn.nnn.7:31231 (cilium-test/echo-other-node) to become ready...
⌛ [clustername] Waiting for NodePort nnn.nnn.nnn.7:32187 (cilium-test/echo-same-node) to become ready...
⌛ [clustername] Waiting for NodePort nnn.nnn.nnn.208:32187 (cilium-test/echo-same-node) to become ready...
⌛ [clustername] Waiting for NodePort nnn.nnn.nnn.208:31231 (cilium-test/echo-other-node) to become ready...
⌛ [clustername] Waiting for NodePort nnn.nnn.nnn.209:31231 (cilium-test/echo-other-node) to become ready...
⌛ [clustername] Waiting for NodePort nnn.nnn.nnn.209:32187 (cilium-test/echo-same-node) to become ready...
ℹ️  Skipping IPCache check
⌛ [clustername] Waiting for pod cilium-test/client-7568bc7f86-2mdgt to reach default/kubernetes service...
⌛ [clustername] Waiting for pod cilium-test/client2-686d5f784b-5llc9 to reach default/kubernetes service...
🔭 Enabling Hubble telescope...
⚠️  Unable to contact Hubble Relay, disabling Hubble telescope and flow validation: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp [::1]:4245: connect: connection refused"
ℹ️  Expose Relay locally with:
   cilium hubble enable
   cilium hubble port-forward&
🏃 Running tests...

[=] Test [no-policies]
................................................
[=] Test [allow-all]
............................................
[=] Test [client-ingress]
..
[=] Test [echo-ingress]
....
[=] Test [client-egress]
....
[=] Test [to-entities-world]
......
[=] Test [to-cidr-1111]
....
[=] Test [echo-ingress-l7]
....
[=] Test [client-egress-l7]
........
  ℹ️  📜 Applying CiliumNetworkPolicy 'client-egress-only-dns' to namespace 'cilium-test'..
  ℹ️  📜 Applying CiliumNetworkPolicy 'client-egress-l7-http' to namespace 'cilium-test'..
  [-] Scenario [client-egress-l7/pod-to-pod]
  [.] Action [client-egress-l7/pod-to-pod/curl-0: cilium-test/client-7568bc7f86-2mdgt (nnn.nnn.3.67) -> cilium-test/echo-same-node-5767b7b99d-rmcqc (nnn.nnn.3.159:8080)]
  [.] Action [client-egress-l7/pod-to-pod/curl-1: cilium-test/client-7568bc7f86-2mdgt (nnn.nnn.3.67) -> cilium-test/echo-other-node-59d779959c-2jggr (nnn.nnn.4.193:8080)]
  [.] Action [client-egress-l7/pod-to-pod/curl-2: cilium-test/client2-686d5f784b-5llc9 (nnn.nnn.3.88) -> cilium-test/echo-other-node-59d779959c-2jggr (nnn.nnn.4.193:8080)]
  [.] Action [client-egress-l7/pod-to-pod/curl-3: cilium-test/client2-686d5f784b-5llc9 (nnn.nnn.3.88) -> cilium-test/echo-same-node-5767b7b99d-rmcqc (nnn.nnn.3.159:8080)]
  [-] Scenario [client-egress-l7/pod-to-world]
  [.] Action [client-egress-l7/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-7568bc7f86-2mdgt (nnn.nnn.3.67) -> one-one-one-one-http (one.one.one.one:80)]
  [.] Action [client-egress-l7/pod-to-world/https-to-one-one-one-one-0: cilium-test/client-7568bc7f86-2mdgt (nnn.nnn.3.67) -> one-one-one-one-https (one.one.one.one:443)]
  [.] Action [client-egress-l7/pod-to-world/https-to-one-one-one-one-index-0: cilium-test/client-7568bc7f86-2mdgt (nnn.nnn.3.67) -> one-one-one-one-https-index (one.one.one.one:443)]
  [.] Action [client-egress-l7/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-5llc9 (nnn.nnn.3.88) -> one-one-one-one-http (one.one.one.one:80)]
  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://one.one.one.one:80" failed: command terminated with exit code 28
  ℹ️  curl output:
  curl: (28) Resolving timed out after 5000 milliseconds
:0 -> :0 = 000
  
  📄 No flows recorded during action http-to-one-one-one-one-1
  📄 No flows recorded during action http-to-one-one-one-one-1
  [.] Action [client-egress-l7/pod-to-world/https-to-one-one-one-one-1: cilium-test/client2-686d5f784b-5llc9 (nnn.nnn.3.88) -> one-one-one-one-https (one.one.one.one:443)]
  [.] Action [client-egress-l7/pod-to-world/https-to-one-one-one-one-index-1: cilium-test/client2-686d5f784b-5llc9 (nnn.nnn.3.88) -> one-one-one-one-https-index (one.one.one.one:443)]
  ℹ️  📜 Deleting CiliumNetworkPolicy 'client-egress-only-dns' from namespace 'cilium-test'..
  ℹ️  📜 Deleting CiliumNetworkPolicy 'client-egress-l7-http' from namespace 'cilium-test'..

[=] Test [dns-only]
..........
[=] Test [to-fqdns]
.
  ℹ️  📜 Applying CiliumNetworkPolicy 'client-egress-to-fqdns-one-one-one-one' to namespace 'cilium-test'..
  [-] Scenario [to-fqdns/pod-to-world]
  [.] Action [to-fqdns/pod-to-world/http-to-one-one-one-one-0: cilium-test/client2-686d5f784b-5llc9 (nnn.nnn.3.88) -> one-one-one-one-http (one.one.one.one:80)]
  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://one.one.one.one:80" failed: command terminated with exit code 28
  ℹ️  curl output:
  curl: (28) Resolving timed out after 5000 milliseconds
:0 -> :0 = 000
  
  📄 No flows recorded during action http-to-one-one-one-one-0
  📄 No flows recorded during action http-to-one-one-one-one-0
  [.] Action [to-fqdns/pod-to-world/https-to-one-one-one-one-0: cilium-test/client2-686d5f784b-5llc9 (nnn.nnn.3.88) -> one-one-one-one-https (one.one.one.one:443)]
  [.] Action [to-fqdns/pod-to-world/https-to-one-one-one-one-index-0: cilium-test/client2-686d5f784b-5llc9 (nnn.nnn.3.88) -> one-one-one-one-https-index (one.one.one.one:443)]
  [.] Action [to-fqdns/pod-to-world/http-to-one-one-one-one-1: cilium-test/client-7568bc7f86-2mdgt (nnn.nnn.3.67) -> one-one-one-one-http (one.one.one.one:80)]
  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://one.one.one.one:80" failed: command terminated with exit code 28
  ℹ️  curl output:
  curl: (28) Resolving timed out after 5000 milliseconds
:0 -> :0 = 000
  
  📄 No flows recorded during action http-to-one-one-one-one-1
  📄 No flows recorded during action http-to-one-one-one-one-1
  [.] Action [to-fqdns/pod-to-world/https-to-one-one-one-one-1: cilium-test/client-7568bc7f86-2mdgt (nnn.nnn.3.67) -> one-one-one-one-https (one.one.one.one:443)]
  [.] Action [to-fqdns/pod-to-world/https-to-one-one-one-one-index-1: cilium-test/client-7568bc7f86-2mdgt (nnn.nnn.3.67) -> one-one-one-one-https-index (one.one.one.one:443)]
  ℹ️  📜 Deleting CiliumNetworkPolicy 'client-egress-to-fqdns-one-one-one-one' from namespace 'cilium-test'..

📋 Test Report
❌ 2/11 tests failed (3/142 actions), 0 tests skipped, 0 scenarios skipped:
Test [client-egress-l7]:
  ❌ client-egress-l7/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-686d5f784b-5llc9 (nnn.nnn.3.88) -> one-one-one-one-http (one.one.one.one:80)
Test [to-fqdns]:
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-0: cilium-test/client2-686d5f784b-5llc9 (nnn.nnn.3.88) -> one-one-one-one-http (one.one.one.one:80)
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-1: cilium-test/client-7568bc7f86-2mdgt (nnn.nnn.3.67) -> one-one-one-one-http (one.one.one.one:80)
Connectivity test failed: 2 tests failed

Anything else?

This issue https://github.com/cilium/cilium/issues/18273 has a similar failing test, but only mentions one of the tests failing. Not both, like in my result.

I reran the connectivity test today, about two weeks after my initial run (and slack post). Same result. That said, the few things I did start running on my cluster all seem to be working fine.

I do have the sysdump file, but I’m reluctant to upload it until I confirm it doesn’t contain anything I don’t want it to contain. If it’s really needed, let me know.

Thanks in advance!

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 26 (11 by maintainers)

Most upvoted comments

Hmm… Rich rules might be one way to get around things, but I haven’t done anything with them before…

I did find this comment in firewalld’s issue queue: https://github.com/firewalld/firewalld/issues/767#issuecomment-790687269 That led to me adding cilium_host, cilium_net, and cilium_vxlan to the trusted zone. After restarting everything, the connectivity tests all passed just fine. :\

That makes me wonder if I need to be concerned about the plethora of virtual nics that Kube creates. All the lxc* nics listed in ifconfig’s output. There’s 1 per pod, right? Do I need to somehow make sure they’re also added to the trusted zone? I mean, I thought that Kube would add/modify any needed rules for virtual nics it creates when it starts up, presumably after firewalld. But since manually adding the cilium_* nics to the trusted zone made a difference, I’m not sure anymore.

Guess I’d better dive into how Kube networking works a bit deeper.

Are you able to narrow down the list of rules by inspecting the drop counters via iptables-save -c ...?

Hm, that sounds like a cilium-cli issue to me rather than a Cilium issue, given that you can run the commands manually successfully.

@tklauser Should we transfer this issue to the cilium-cli repo?