argo-workflows: Zombie workflows (aka "stuck workflows")

A “zombie workflow” is one that starts but does not complete. Pods are scheduled and run to completion, but the workflow is not subsequently updated.

It is as if the workflow controller never sees the pod changes.

Impacted users:

All users have been running very large workflows.

Typically:

  • Zombie workflows are running 5000+ pods at once.
  • insignificant pod change is not seen in the controller logs.
  • Deadline exceeded is seen in the logs. Increasing the CPU and memory on the Kubernetes master node may fix this.

Things that don’t appear to work (rejected hypothesis):

  • Changing --burst or --qps, QPS settings.
  • Changing --workflow-workflow or --pod-workers settings. This only impacts concurrent processing.
  • Increasing the controller’s memory or CPU.
  • Setting ALL_POD_CHANGES_SIGNIFICANT=true. Hypothesis: we’re missing significant pod changes.
  • Setting INFORMER_WRITE_BACK=false.

Questions:

  • Every 20m workflow should re-reconcile. Did waiting fix it?
  • Did restarting the controller fix it?
  • Is that pods didn’t start? Or that we don’t see their completion?
  • Does the zombie workflow have the 'workflows.argoproj.io/completed: true` label?

Users should try the following:

  • Run argoproj/workflow-controller:v2.11.7 - this is faster than v2.11.6 and all previous versions. Suitable for production.
  • Set GC settings: https://argoproj.github.io/argo/cost-optimisation/#limit-the-total-number-of-workflows-and-pods
  • Delete any old completed workflows.
  • Run argoproj/workflow-controller:latest.
  • Run argoproj/workflow-controller:mot with env MAX_OPERATION_TIME=30s. Make sure it logs defaultRequeueTime=30s maxOperationTime=30s. Hypothesis: we need more time to schedule pods.
  • Run argoproj/workflow-controller:easyjson. Hypothesis: JSON marshaling is very slow.

If none of this works then we need to investigate deeper.

Related:

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 8
  • Comments: 15 (8 by maintainers)

Most upvoted comments

@alexec:

workflow-controller: mot+4998b2d.dirty
  BuildDate: 2020-11-19T17:56:18Z
  GitCommit: 4998b2d6574adfe039b9c037251ecc717e7f1996
  GitTreeState: dirty
  GitTag: latest
  GoVersion: go1.13.15
  Compiler: gc
  Platform: linux/amd64
time="2020-11-20T21:16:29Z" level=info defaultRequeueTime=30s maxOperationTime=30s

I’ve created a POC engineering build that uses S3 to offload and archive workflows to instead of MySQL or Postgres. My hypothesis is that offloading there maybe faster for users running large (5009+ node) workflows, or that archiving maybe more to many users. On top of this, it maybe cheaper for many users. I challenge you to prove me wrong. https://github.com/argoproj/argo/pull/4582

We were able to replicate the findings here - can verify that the CPU usage is way down with hundreds of concurrent workflows running.