cilium-cli: flake: timeout reached waiting for service (echo-same-node or echo-other-node)

flake instances

symptoms

cilium connectivity test times out waiting for echo-same-node or echo-other-node service.

⌛ [cilium-cilium-cli-950610632] Waiting for deployments [client client2 echo-same-node] to become ready...
⌛ [cilium-cilium-cli-950610632] Waiting for deployments [echo-other-node] to become ready...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/client-7b7bf54b85-blm9q to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/client2-666976c95b-6zhqj to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/echo-other-node-697d5d69b7-2nnln to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/echo-same-node-7967996674-rt6hg to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for Service cilium-test/echo-other-node to become ready...

Error: Connectivity test failed: timeout reached waiting for ***&Service***ObjectMeta:***echo-other-node  cilium-test /api/v1/namespaces/cilium-test/services/echo-other-node fd2c3c95-c8b4-48be-b159-1bdfb2041050 1725 0 2021-06-18 17:38:45 +0000 UTC <nil> <nil> map[kind:echo] map[] [] []  [***cilium Update v1 2021-06-18 17:38:45 +0000 UTC FieldsV1 ***"f:metadata":***"f:labels":***".":***,"f:kind":***,"f:spec":***"f:externalTrafficPolicy":***,"f:ports":***".":***,"k:***\"port\":8080,\"protocol\":\"TCP\"***":***".":***,"f:name":***,"f:port":***,"f:protocol":***,"f:targetPort":***,"f:selector":***".":***,"f:name":***,"f:sessionAffinity":***,"f:type":***]***,Spec:ServiceSpec***Ports:[]ServicePort***ServicePort***Name:echo-other-node,Protocol:TCP,Port:8080,TargetPort:***0 8080 ***,NodePort:30597,AppProtocol:nil,***,***,Selector:map[string]string***name: echo-other-node,***,ClusterIP:10.0.171.109,Type:NodePort,ExternalIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,ExternalTrafficPolicy:Cluster,HealthCheckNodePort:0,PublishNotReadyAddresses:false,SessionAffinityConfig:nil,IPFamily:nil,TopologyKeys:[],***,Status:ServiceStatus***LoadBalancer:LoadBalancerStatus***Ingress:[]LoadBalancerIngress***,***,***,***

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 16 (14 by maintainers)

Commits related to this issue

Most upvoted comments

I tried to mention that during the last community meeting, but my mic quality was bad: Given that we don’t even see the DNS requests hitting the target node, I think the CoreDNS hypothesis does not apply to this flake here.

I think the discussion related to restarting CoreDNS only apply to https://github.com/cilium/cilium/issues/17401 - there DNS requests are hitting the target CoreDNS pod, but CoreDNS does not know about the service yet.

I think these are two separate issues, even thought the symptoms (K8s service not found) are very similar, we should not mix them up:

Symptom here #342: DNS lookup for service fails due to timeout (no answer) Symptom in https://github.com/cilium/cilium/issues/17401: DNS lookup for service fails with NXDOMAIN

I think it should already be the case that we run connectivity tests with --debug, see #409. Might have missing some jobs though thinking

Ah! Yes, you are correct. I was hitting and debugging this issue in cilium/cilium which does not have debug logs, but we do have some traces from cilium/cilium-cli with the error message. In that case no action needed regarding that point, we already do have the stderr (from https://github.com/cilium/cilium-cli/issues/342#issuecomment-919978939):

2021-09-14T14:55:43.345571900Z 🐛 Error waiting for service cilium-test/echo-other-node: command terminated with exit code 1: ^C

Not sure where that ^C comes from 🤔

This flake was happening a lot on AKS, but AKS testing has been disabled due to probable changes on AKS’ side and nobody had time to look at it yet 😬