argo-workflows: Workflow steps fail with a 'pod deleted' message.
Summary
Maybe related to #3381?
Some of the workflow steps end up in Error
state, with pod deleted
. I am not sure which of the following data points are relevant, but listing all observations:
- the workflow uses
PodGC: strategy: OnPodSuccess
. - we are seeing this for ~5% of workflow steps.
- affected steps are a part of a
withItems
loop - the workflow is not large - ~170 to 300 concurrent nodes
- this is observed since deploying v2.12.0rc2 yesterday, including v2.12.0rc2 executor image. We were previously on v2.11.6 and briefly on v2.11.7, and have not seen this.
- k8s events confirm the pods ran to completion
- cluster scaling has been ruled out as the cause - this is observed on multiple k8s nodes, all of which are still running
- we have not tried the same workflow without PodGC yet.
Diagnostics
What Kubernetes provider are you using?
docker
What version of Argo Workflows are you running?
v2.12.0rc2
for all components
Message from the maintainers:
Impacted by this bug? Give it a đź‘Ť. We prioritise the issues with the most đź‘Ť.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 14
- Comments: 40 (22 by maintainers)
Hello, faced the same issue with Argo v2.12.8. Garbage collection is set to “onPodCompletion”. Affected steps are a part of a withItems loop and it happens with both 100 or 50 parallelism (haven’t tried less).
(36 out 2000 steps failed with “pod deleted”)
If I change garbage collection to wait until the end of workflow it does not reproduce anymore. Do you have any suggestions on how I can prevent this from happening?
hey @alexec - thank you - I was offline for a couple of days, but will test it as soon as I can. It is unlikely to happen today.
This is correct. Definitely not as reproducible anymore. For what it’s worth, my team has been running jobs on the
:mot
build since Friday (4 days), and many very large workflows have completed perfectly, while some smaller ones experienced pod deletion. There doesn’t seem to be any systematic pattern to it - it’s very intermittent. I’ll try the new:grace
build when it’s safe to do so; we’re running some really critical workloads right now so I can’t promise to do it soon, unfortunately.I wasn’t able to reproduce this at all (still on
:grace
). And see none of the above log messages… Tomorrow I’m going to try reproducing this again on 2.12.0rc2 A silly thought perhaps, but could this have been caused by another workflow? The only thing that changed throughout the day (aside from my testing) was that we deleted some previously failed workflows.Yes. Remove
INFORMER_WRITE_BACK
.Can you please try running your controller with
INFORMER_WRITE_BACK=false
?