cilium: CI: ConformanceAKS: curl succeeded while it should have failed due to incorrect exit code

Several curl tests are failing because the commands succeed when we expected them to fail. For example:

[=] Test [all-ingress-deny]
..
  ℹ️  📜 Applying CiliumNetworkPolicy 'all-ingress-deny' to namespace 'cilium-test'..
  [-] Scenario [all-ingress-deny/pod-to-pod]
  [.] Action [all-ingress-deny/pod-to-pod/curl-0: cilium-test/client2-bb57d9bb4-b5pbz (10.0.1.204) -> cilium-test/echo-other-node-694dd79b46-qzslj (10.0.0.220:8080)]
  [.] Action [all-ingress-deny/pod-to-pod/curl-1: cilium-test/client2-bb57d9bb4-b5pbz (10.0.1.204) -> cilium-test/echo-same-node-865dcc9b58-nz2tf (10.0.1.86:8080)]
  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.1.86:8080/" succeeded while it should have failed: 
  ℹ️  curl output:

Examples: https://github.com/cilium/cilium/actions/runs/3461307458/jobs/5779401491 https://github.com/cilium/cilium/actions/runs/3457550315/jobs/5771080426 https://github.com/cilium/cilium/actions/runs/3456343637/jobs/5769011681

Sysdumps for the first two examples above: cilium-sysdump-out.zip.zip cilium-sysdump-out.zip(1).zip

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 29 (29 by maintainers)

Commits related to this issue

Most upvoted comments

https://github.com/cilium/cilium/actions/runs/3638728566/jobs/6142126513 has an example of this happening with ping instead of curl. So definitely not a curl issue.

Just pushed a change to the CLI to capture sysdumps right after a test fails: https://github.com/cilium/cilium-cli/pull/1228, hopefully that will give us some better visibility into what’s going on 🤞

No reports since my fix and Marco’s followup. Let’s close.

Ah, my “fix” doesn’t work for ping because there is some output at the beginning even when it hangs. Maybe we can change the ping command we use to remove that first line:

PING 10.0.0.83 (10.0.0.83) 56(84) bytes of data.

I managed to get the kubelet logs for https://github.com/cilium/cilium/actions/runs/3650375007/jobs/6181499675

[=] Test [echo-ingress-l7]
..
  ℹ️  📜 Applying CiliumNetworkPolicy 'echo-ingress-l7-http' to namespace 'cilium-test'..
  [-] Scenario [echo-ingress-l7/pod-to-pod-with-endpoints]
  [.] Action [echo-ingress-l7/pod-to-pod-with-endpoints/curl-0-public: cilium-test/client-769787f6f8-jmvds (10.0.1.144) -> curl-0-public (10.0.0.155:8080)]
  [.] Action [echo-ingress-l7/pod-to-pod-with-endpoints/curl-0-private: cilium-test/client-769787f6f8-jmvds (10.0.1.144) -> curl-0-private (10.0.0.155:8080)]
  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.0.155:8080/private" succeeded while it should have failed: 
  ℹ️  curl output:
  
  
  📄 No flows recorded during action curl-0-private
  📄 No flows recorded during action curl-0-private

kubelet-1.log kubelet-2.log

but I don’t see anything interesting beside

kubelet-1.log:13:28:03.901767    3962 log.go:184] http: TLS handshake error from 10.224.0.5:47404: EOF

(which just matches the timestamp of the failure but might be a red herring)

It’s as if the curl command had been interrupted before we reached the timeout (despite what the timestamps suggest).

There’s been a similar report for the exact same logic we have in k8s/exec.go, where they suggested to use stdin: false.

For our case we can’t just go for that as we need a way to interrupt the execution of the command after it times out, so I’m giving it a try to this new StreamWithContext method (needs to be backported to our own client-go fork)