test-infra: Prow Fails if local DNS is not ready on startup

What happened:

A number of jobs are failing with an error like:

ssh: Could not resolve hostname github.com: Try again
fatal: Could not read from remote repository.

This happens in different cases, but the one I am dealing with now is that it happens if the pod is launched on a node that just started and the node local DNS has not initialized. Normally this is not a problem for kubernetes pods as they are built to be resilient against service failures, including DNS, and will retry or exit and be restarted.

Prow however does not restart these pods and does not retry the operation in case of DNS failure.

What you expected to happen:

Prow jobs should retry some number of times, either as part of the checkout job logic or just by exiting the pod and the pod is recreated, like how things are normally done in kubernetes.

How to reproduce it (as minimally and precisely as possible):

It seems tricky to reproduce, but if you are able to run a prow job where DNS is not available, you will see it fail immediately instead of retrying.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 25 (24 by maintainers)

Most upvoted comments

If that’s happening before ready then we should fix nose local DNS addon instead.

What I think I’m seeing is that the node local dns is just running as a normal pod, and there’s no logic to prevent other pods from launching prior to node local dns being ready to serve requests. Are you aware of a mechanism that should prevent pods from being started on a node prior to the node local DNS being ready?

It’s a useful workaround that applies to more than just this issue, DNS calls in kubernetes can be flaky in part due to all the extra search traffic.

Maybe it deserves its own issue, though, otherwise if this is closed with some other workaround your idea will be lost.

I see the dnsPolicy setting more like a workaround for the original issue. Feels like we could be a bit more robust and perhaps retry talking to GH.

This is fair, but it’s worth noting that for jobs not trying to talk to in-cluster-services and just using prow as a general CI, doing the full 5-tier search path is slow and flaky and tuning ndots is a general useful knob for workloads that don’t need it. Even for things that are talking within the cluster, you don’t need ndots to be so high if you just use the full service path including namespace.

We’ve had to employ the FQDN workaround (if you call foo.bar.baz.cluster.local. then there’s no search) which is harder to propagate to all network calls. (kubernetes test-infra currently does this for boskos)

It’s a useful workaround that applies to more than just this issue, DNS calls in kubernetes can be flaky in part due to all the extra search traffic.

I see the dnsPolicy setting more like a workaround for the original issue. Feels like we could be a bit more robust and perhaps retry talking to GH.