argo-workflows: Workflow steps fail with a 'pod deleted' message.

Summary

Maybe related to #3381?

Some of the workflow steps end up in Error state, with pod deleted. I am not sure which of the following data points are relevant, but listing all observations:

the workflow uses PodGC: strategy: OnPodSuccess.
we are seeing this for ~5% of workflow steps.
affected steps are a part of a withItems loop
the workflow is not large - ~170 to 300 concurrent nodes
this is observed since deploying v2.12.0rc2 yesterday, including v2.12.0rc2 executor image. We were previously on v2.11.6 and briefly on v2.11.7, and have not seen this.
k8s events confirm the pods ran to completion
cluster scaling has been ruled out as the cause - this is observed on multiple k8s nodes, all of which are still running
we have not tried the same workflow without PodGC yet.

Diagnostics

What Kubernetes provider are you using?

docker

What version of Argo Workflows are you running?

v2.12.0rc2 for all components

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 14
Comments: 40 (22 by maintainers)

Most upvoted comments

Hello, faced the same issue with Argo v2.12.8. Garbage collection is set to “onPodCompletion”. Affected steps are a part of a withItems loop and it happens with both 100 or 50 parallelism (haven’t tried less).

(36 out 2000 steps failed with “pod deleted”)

If I change garbage collection to wait until the end of workflow it does not reproduce anymore. Do you have any suggestions on how I can prevent this from happening?

dynamicmindset on Feb 12, 2021

hey @alexec - thank you - I was offline for a couple of days, but will test it as soon as I can. It is unlikely to happen today.

Not fixed - but less of a problem?

This is correct. Definitely not as reproducible anymore. For what it’s worth, my team has been running jobs on the :mot build since Friday (4 days), and many very large workflows have completed perfectly, while some smaller ones experienced pod deletion. There doesn’t seem to be any systematic pattern to it - it’s very intermittent. I’ll try the new :grace build when it’s safe to do so; we’re running some really critical workloads right now so I can’t promise to do it soon, unfortunately.

ebr on Nov 24, 2020

I wasn’t able to reproduce this at all (still on :grace). And see none of the above log messages… Tomorrow I’m going to try reproducing this again on 2.12.0rc2 A silly thought perhaps, but could this have been caused by another workflow? The only thing that changed throughout the day (aside from my testing) was that we deleted some previously failed workflows.

ebr on Nov 20, 2020

Yes. Remove INFORMER_WRITE_BACK.

alexec on Nov 19, 2020

Can you please try running your controller with INFORMER_WRITE_BACK=false?

alexec on Nov 19, 2020