cilium-cli: flake: timeout reached waiting for service (echo-same-node or echo-other-node)
flake instances
- https://github.com/cilium/cilium-cli/pull/331/checks?check_run_id=2861168613
- https://github.com/cilium/cilium-cli/runs/2842079516?check_suite_focus=true#341
- https://github.com/cilium/cilium-cli/pull/336/checks?check_run_id=2860556736
symptoms
cilium connectivity test times out waiting for echo-same-node or echo-other-node service.
⌛ [cilium-cilium-cli-950610632] Waiting for deployments [client client2 echo-same-node] to become ready...
⌛ [cilium-cilium-cli-950610632] Waiting for deployments [echo-other-node] to become ready...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/client-7b7bf54b85-blm9q to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/client2-666976c95b-6zhqj to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/echo-other-node-697d5d69b7-2nnln to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/echo-same-node-7967996674-rt6hg to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for Service cilium-test/echo-other-node to become ready...
Error: Connectivity test failed: timeout reached waiting for ***&Service***ObjectMeta:***echo-other-node cilium-test /api/v1/namespaces/cilium-test/services/echo-other-node fd2c3c95-c8b4-48be-b159-1bdfb2041050 1725 0 2021-06-18 17:38:45 +0000 UTC <nil> <nil> map[kind:echo] map[] [] [] [***cilium Update v1 2021-06-18 17:38:45 +0000 UTC FieldsV1 ***"f:metadata":***"f:labels":***".":***,"f:kind":***,"f:spec":***"f:externalTrafficPolicy":***,"f:ports":***".":***,"k:***\"port\":8080,\"protocol\":\"TCP\"***":***".":***,"f:name":***,"f:port":***,"f:protocol":***,"f:targetPort":***,"f:selector":***".":***,"f:name":***,"f:sessionAffinity":***,"f:type":***]***,Spec:ServiceSpec***Ports:[]ServicePort***ServicePort***Name:echo-other-node,Protocol:TCP,Port:8080,TargetPort:***0 8080 ***,NodePort:30597,AppProtocol:nil,***,***,Selector:map[string]string***name: echo-other-node,***,ClusterIP:10.0.171.109,Type:NodePort,ExternalIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,ExternalTrafficPolicy:Cluster,HealthCheckNodePort:0,PublishNotReadyAddresses:false,SessionAffinityConfig:nil,IPFamily:nil,TopologyKeys:[],***,Status:ServiceStatus***LoadBalancer:LoadBalancerStatus***Ingress:[]LoadBalancerIngress***,***,***,***
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 16 (14 by maintainers)
Commits related to this issue
- check: Add extra info Print the number of times the validation command got executed before timeout was reached. I just want to confirm that the command is not getting stuck for a long time. Ref: #34... — committed to cilium/cilium-cli by michi-covalent 3 years ago
- check: Add extra info Print the number of times the validation command got executed before timeout was reached. I just want to confirm that the command is not getting stuck for a long time. Ref: #34... — committed to cilium/cilium-cli by michi-covalent 3 years ago
- check: Add extra info Print the number of times the validation command got executed before timeout was reached. I just want to confirm that the command is not getting stuck for a long time. Ref: #34... — committed to cilium/cilium-cli by michi-covalent 3 years ago
- check: Add extra info Print the number of times the validation command got executed before timeout was reached. I just want to confirm that the command is not getting stuck for a long time. Ref: #34... — committed to cilium/cilium-cli by michi-covalent 3 years ago
- workflows: disable AKS testing with encryption enabled https://github.com/cilium/cilium-cli/issues/342 is hit almost consistently on AKS when running the second `cilium connectivity test` with encryp... — committed to cilium/cilium by nbusseneau 3 years ago
- workflows: disable AKS testing with encryption enabled https://github.com/cilium/cilium-cli/issues/342 is hit almost consistently on AKS when running the second `cilium connectivity test` with encryp... — committed to cilium/cilium by nbusseneau 3 years ago
- workflows: disable AKS testing with encryption enabled https://github.com/cilium/cilium-cli/issues/342 is hit almost consistently on AKS when running the second `cilium connectivity test` with encryp... — committed to cilium/cilium by nbusseneau 3 years ago
- workflows: disable AKS testing with encryption enabled https://github.com/cilium/cilium-cli/issues/342 is hit almost consistently on AKS when running the second `cilium connectivity test` with encryp... — committed to f5devcentral/cilium by nbusseneau 3 years ago
- workflows: disable AKS testing with encryption enabled https://github.com/cilium/cilium-cli/issues/342 is hit almost consistently on AKS when running the second `cilium connectivity test` with encryp... — committed to cilium/cilium by nbusseneau 3 years ago
- connectivity/check: Fix wrong NodePort service selection on validation It is possible for the tuples of node IP and port to be mismatched in the case of NodePort services, causing the connectivity te... — committed to cilium/cilium-cli by christarazi 2 years ago
- connectivity/check: Fix wrong NodePort service selection on validation It is possible for the tuples of node IP and port to be mismatched in the case of NodePort services, causing the connectivity te... — committed to cilium/cilium-cli by christarazi 2 years ago
I tried to mention that during the last community meeting, but my mic quality was bad: Given that we don’t even see the DNS requests hitting the target node, I think the CoreDNS hypothesis does not apply to this flake here.
I think the discussion related to restarting CoreDNS only apply to https://github.com/cilium/cilium/issues/17401 - there DNS requests are hitting the target CoreDNS pod, but CoreDNS does not know about the service yet.
I think these are two separate issues, even thought the symptoms (K8s service not found) are very similar, we should not mix them up:
Symptom here #342: DNS lookup for service fails due to timeout (no answer) Symptom in https://github.com/cilium/cilium/issues/17401: DNS lookup for service fails with NXDOMAIN
Ah! Yes, you are correct. I was hitting and debugging this issue in
cilium/ciliumwhich does not have debug logs, but we do have some traces fromcilium/cilium-cliwith the error message. In that case no action needed regarding that point, we already do have the stderr (from https://github.com/cilium/cilium-cli/issues/342#issuecomment-919978939):Not sure where that
^Ccomes from 🤔This flake was happening a lot on AKS, but AKS testing has been disabled due to probable changes on AKS’ side and nobody had time to look at it yet 😬