test-infra: Jobs fail with an error "Job execution failed: Pod got deleted unexpectedly"
What happened:
Some jobs, for example ci-kubernetes-node-kubelet-serial-cpu-manager and ci-kubernetes-node-kubelet-serial-hugepages fail with “Job execution failed: Pod got deleted unexpectedly”. build-log.txt doesn’t exists among job artifacts.
What you expected to happen:
Job should either fail or succeed and build-log.txt should clearly describe the test flow.
How to reproduce it (as minimally and precisely as possible):
It’s hard to reproduce as it happens quite rarely.
Please provide links to example occurrences, if any:
- https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial-cpu-manager/1401392642746486784
- https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial-hugepages/1401423093053788160
Anything else we need to know?:
the only error message I’was able to spot in this podingo.json was this:
MountVolume.SetUp failed for volume \"service\" : failed to sync secret cache: timed out waiting for the condition.
However, I don’t see this kind of errors in the second failed job.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 26 (21 by maintainers)
Well FWIW I have spotted this issue elsewhere more recently https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-kind-conformance/1402563181775163392
So I’m going to raise it to the oncall team.
I don’t think so. You might find better luck filing another issue or asking on #prow kubernetes.slack.com though.
@BenTheElder Is there a way to have prow retrigger jobs that fail with this error?
More context: We see this error quite often if the AZ runs out of spot instances. Looking for a way to have prow rerun such jobs
Sounds promising. Let’s see if it happens again after the action. If not - I’ll close this issue.
@BenTheElder
Both tests are passing most of the time. Hugepages test failed on 5th and 7th of June and 26 of May.
OK, if this is so hard to debug let’s monitor this further. If it happens again periodically I’ll try to contact oncall team. Thank you for your help!
The tests run as kubernetes pods and can be deleted by the host cluster if they’re consuming excessive resources.
If they’re deleted the logs are lost due to prow’s structural design around runnning pods (the logs are uploaded from the pod output on completion, by the pod).
You should probably be looking at resource requests / limits, and disk usage.