test-infra: Jobs fail with an error "Job execution failed: Pod got deleted unexpectedly"

What happened:

Some jobs, for example ci-kubernetes-node-kubelet-serial-cpu-manager and ci-kubernetes-node-kubelet-serial-hugepages fail with “Job execution failed: Pod got deleted unexpectedly”. build-log.txt doesn’t exists among job artifacts.

What you expected to happen:

Job should either fail or succeed and build-log.txt should clearly describe the test flow.

How to reproduce it (as minimally and precisely as possible):

It’s hard to reproduce as it happens quite rarely.

Please provide links to example occurrences, if any:

Anything else we need to know?:

the only error message I’was able to spot in this podingo.json was this: MountVolume.SetUp failed for volume \"service\" : failed to sync secret cache: timed out waiting for the condition. However, I don’t see this kind of errors in the second failed job.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 26 (21 by maintainers)

Most upvoted comments

Well FWIW I have spotted this issue elsewhere more recently https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-kind-conformance/1402563181775163392

So I’m going to raise it to the oncall team.

BenTheElder on Jun 9, 2021

I don’t think so. You might find better luck filing another issue or asking on #prow kubernetes.slack.com though.

BenTheElder on Jun 11, 2021

@BenTheElder Is there a way to have prow retrigger jobs that fail with this error?

More context: We see this error quite often if the AZ runs out of spot instances. Looking for a way to have prow rerun such jobs

qaifshaikh on Jun 11, 2021

Sounds promising. Let’s see if it happens again after the action. If not - I’ll close this issue.

bart0sh on Jun 10, 2021

@BenTheElder

I don’t know that the cluster even has logs from that far back, and it’s all passing now? https://prow.k8s.io/?job=ci-kubernetes-node-kubelet-serial-cpu-manager

Both tests are passing most of the time. Hugepages test failed on 5th and 7th of June and 26 of May.

OK, if this is so hard to debug let’s monitor this further. If it happens again periodically I’ll try to contact oncall team. Thank you for your help!

bart0sh on Jun 9, 2021

The tests run as kubernetes pods and can be deleted by the host cluster if they’re consuming excessive resources.

If they’re deleted the logs are lost due to prow’s structural design around runnning pods (the logs are uploaded from the pod output on completion, by the pod).

You should probably be looking at resource requests / limits, and disk usage.

BenTheElder on Jun 7, 2021