argo-workflows: podGC: onPodCompletion not working properly in v3.4.9

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I’d like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Workflow-controller garbage collection doesn’t work properly.

After we updated our Argo-workflows from 3.3.8 to 3.4.9 there was issue with the argo controller failing to remove complete pods. We are using podGC: OnPodCompletion strategy for all workflows in our controller-configmap. Most of the time the pod was removed properly, but maybe 1-5% of the workflows it didn’t remove the finished pod properly and the pod just stayed in the cluster as completed.

When looking at the workflow-controller logs there seems to be some issues with the garbage collection, it seems to keep putting same workflow to the garbage collection queue multiple times, even tho it should only be put there once. Now the logs are constantly spamming lines like these, while previously with 3.3.8 this wasn’t the case.

I don’t get why this workflow is put into the queue multiple times? Once should be enough, and looking at previous version logs that’s what was the case.

Here’s the number of log lines from the controller from our DataDog (ignore the color changes). The number of log lines was stable low in the previous version, but when we updated to 3.4.9 the number of log lines went up. At the end of the image we decided to go back to 3.3.8 and the number of log lines are back to previous level.

Screenshot from 2023-08-16 14-46-50

Version

v3.4.9

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don’t enter a workflows that uses private images.

-

Logs from the workflow controller

$ kubectl logs workflow-controller-7f8bcdc656-9gvg4 | grep "argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt"
time="2023-08-16T09:14:55.884Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 6h17m37s due to TTL"
time="2023-08-16T09:15:00.118Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 6h17m32s due to TTL"
time="2023-08-16T09:22:42.304Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 6h9m50s due to TTL"
time="2023-08-16T09:32:28.796Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 6h0m4s due to TTL"
time="2023-08-16T09:54:55.019Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h37m37s due to TTL"
time="2023-08-16T09:57:37.509Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h34m55s due to TTL"
time="2023-08-16T10:02:42.185Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h29m50s due to TTL"
time="2023-08-16T10:10:17.360Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h22m15s due to TTL"
time="2023-08-16T10:14:55.166Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h17m37s due to TTL"
time="2023-08-16T10:19:45.878Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h12m47s due to TTL"
time="2023-08-16T10:25:22.294Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h7m10s due to TTL"
time="2023-08-16T10:33:06.901Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h59m26s due to TTL"
time="2023-08-16T10:34:56.043Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h57m36s due to TTL"
time="2023-08-16T10:41:50.563Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h50m42s due to TTL"
time="2023-08-16T10:46:52.196Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h45m40s due to TTL"
time="2023-08-16T10:54:55.429Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h37m37s due to TTL"
time="2023-08-16T10:55:52.254Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h36m40s due to TTL"
time="2023-08-16T11:02:24.189Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h30m8s due to TTL"

Logs from in your workflow’s wait container

-

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Reactions: 9
  • Comments: 16 (4 by maintainers)

Most upvoted comments

We see a similar issue after upgrading from 3.4.4 to 3.4.9. In our case, workflows occasionally complete successfully but somehow enter error/failure state afterwards and the exit handler gets triggered again and sends a failure notification.

Here is our workflow (simplified):

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: example-workflow-
spec:
  entrypoint: main
  onExit: exit-handler
  podGC:
    strategy: OnWorkflowCompletion
  templates:
    - name: main
      dag: ...
    - name: exit-handler
      steps:
        - - name: notify-failed
            templateRef:
              name: slack
              template: notify-failed
            when: "{{workflow.status}} != Succeeded"

From the workflow controller logs below you can see that the workflow completes at 2023-09-07T06:08:00.021Z and is marked as pending for archiving. The exit handler was skipped because Succeeded != Succeeded evaluated false. At 2023-09-07T06:08:12.946Z, the task-result reconciliation runs, the workflow pod is missing and the main step fails. The exit handler is then triggered again at 2023-09-07T06:08:12.959Z and fires the failure notification. Since the workflow already completed, the workflow update fails around 2023-09-07T06:08:23.574Z with “must never update completed workflows”. The exit handler is then triggered again but doesn’t send another failure notification.

time="2023-09-07T06:07:21.086Z" level=info msg="Processing workflow" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:21.196Z" level=info msg="adding artifact GC finalizer" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:21.196Z" level=info msg="Updated phase  -> Running" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:21.196Z" level=info msg="DAG node example-workflow-hq9pn initialized Running" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:21.196Z" level=info msg="All of node example-workflow-hq9pn.main dependencies [] completed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:21.201Z" level=info msg="Steps node example-workflow-hq9pn-663554177 initialized Running" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:21.201Z" level=info msg="StepGroup node example-workflow-hq9pn-3224806949 initialized Running" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:21.204Z" level=info msg="Pod node example-workflow-hq9pn-437922017 initialized Pending" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:22.041Z" level=info msg="Created pod: example-workflow-hq9pn.main[0].main-worker (example-workflow-hq9pn-sisyphus-project-template-fan-out-437922017)" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:22.041Z" level=info msg="Workflow step group node example-workflow-hq9pn-3224806949 not yet completed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:22.041Z" level=info msg="TaskSet Reconciliation" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:22.041Z" level=info msg=reconcileAgentPod namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:22.145Z" level=info msg="Workflow update successful" namespace=workflows phase=Running resourceVersion=1330498802 workflow=example-workflow-hq9pn
time="2023-09-07T06:07:32.047Z" level=info msg="Processing workflow" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:32.050Z" level=info msg="Task-result reconciliation" namespace=workflows numObjs=0 workflow=example-workflow-hq9pn
time="2023-09-07T06:07:32.050Z" level=info msg="node changed" namespace=workflows new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=example-workflow-hq9pn-437922017 old.message= old.phase=Pending old.progress=0/1 workflow=example-workflow-hq9pn
time="2023-09-07T06:07:32.060Z" level=info msg="Workflow step group node example-workflow-hq9pn-3224806949 not yet completed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:32.060Z" level=info msg="TaskSet Reconciliation" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:32.060Z" level=info msg=reconcileAgentPod namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:33.471Z" level=info msg="Workflow update successful" namespace=workflows phase=Running resourceVersion=1330499198 workflow=example-workflow-hq9pn
time="2023-09-07T06:07:42.173Z" level=info msg="Processing workflow" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:42.177Z" level=info msg="Task-result reconciliation" namespace=workflows numObjs=0 workflow=example-workflow-hq9pn
time="2023-09-07T06:07:42.177Z" level=info msg="node changed" namespace=workflows new.message= new.phase=Running new.progress=0/1 nodeID=example-workflow-hq9pn-437922017 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=example-workflow-hq9pn
time="2023-09-07T06:07:42.191Z" level=info msg="Workflow step group node example-workflow-hq9pn-3224806949 not yet completed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:42.191Z" level=info msg="TaskSet Reconciliation" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:42.191Z" level=info msg=reconcileAgentPod namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:07:42.315Z" level=info msg="Workflow update successful" namespace=workflows phase=Running resourceVersion=1330499543 workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.004Z" level=info msg="Processing workflow" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.008Z" level=info msg="Task-result reconciliation" namespace=workflows numObjs=1 workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.008Z" level=info msg="task-result changed" namespace=workflows nodeID=example-workflow-hq9pn-437922017 workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.008Z" level=info msg="node changed" namespace=workflows new.message= new.phase=Succeeded new.progress=0/1 nodeID=example-workflow-hq9pn-437922017 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.017Z" level=info msg="Step group node example-workflow-hq9pn-3224806949 successful" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.017Z" level=info msg="node example-workflow-hq9pn-3224806949 phase Running -> Succeeded" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.017Z" level=info msg="node example-workflow-hq9pn-3224806949 finished: 2023-09-07 06:08:00.017528779 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.017Z" level=info msg="StepGroup node example-workflow-hq9pn-2084075952 initialized Running" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.020Z" level=info msg="SG Outbound nodes of example-workflow-hq9pn-437922017 are [example-workflow-hq9pn-437922017]" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.020Z" level=info msg="Skipping example-workflow-hq9pn.main[1].parallel-workers: Skipped, empty params" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.020Z" level=info msg="Skipped node example-workflow-hq9pn-2686192655 initialized Skipped (message: Skipped, empty params)" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.020Z" level=info msg="Step group node example-workflow-hq9pn-2084075952 successful" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.020Z" level=info msg="node example-workflow-hq9pn-2084075952 phase Running -> Succeeded" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.020Z" level=info msg="node example-workflow-hq9pn-2084075952 finished: 2023-09-07 06:08:00.020978629 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.020Z" level=info msg="Outbound nodes of example-workflow-hq9pn-2686192655 is [example-workflow-hq9pn-2686192655]" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Outbound nodes of example-workflow-hq9pn-663554177 is [example-workflow-hq9pn-2686192655]" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="node example-workflow-hq9pn-663554177 phase Running -> Succeeded" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="node example-workflow-hq9pn-663554177 finished: 2023-09-07 06:08:00.02105617 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Checking daemoned children of example-workflow-hq9pn-663554177" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Outbound nodes of example-workflow-hq9pn set to [example-workflow-hq9pn-2686192655]" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="node example-workflow-hq9pn phase Running -> Succeeded" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="node example-workflow-hq9pn finished: 2023-09-07 06:08:00.021116821 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Checking daemoned children of example-workflow-hq9pn" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="TaskSet Reconciliation" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg=reconcileAgentPod namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Running OnExit handler: exit-handler" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Steps node example-workflow-hq9pn-3648729681 initialized Running" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="StepGroup node example-workflow-hq9pn-1754934389 initialized Running" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Skipping example-workflow-hq9pn.onExit[0].notify-failed: when 'Succeeded != Succeeded' evaluated false" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Skipped node example-workflow-hq9pn-180906186 initialized Skipped (message: when 'Succeeded != Succeeded' evaluated false)" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Step group node example-workflow-hq9pn-1754934389 successful" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="node example-workflow-hq9pn-1754934389 phase Running -> Succeeded" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="node example-workflow-hq9pn-1754934389 finished: 2023-09-07 06:08:00.021359114 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Outbound nodes of example-workflow-hq9pn-180906186 is [example-workflow-hq9pn-180906186]" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Outbound nodes of example-workflow-hq9pn-3648729681 is [example-workflow-hq9pn-180906186]" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="node example-workflow-hq9pn-3648729681 phase Running -> Succeeded" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="node example-workflow-hq9pn-3648729681 finished: 2023-09-07 06:08:00.021395154 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Checking daemoned children of example-workflow-hq9pn-3648729681" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Updated phase Running -> Succeeded" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Marking workflow completed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Marking workflow as pending archiving" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.021Z" level=info msg="Checking daemoned children of " namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.026Z" level=info msg="cleaning up pod" action=deletePod key=workflows/example-workflow-hq9pn-1340600742-agent/deletePod
time="2023-09-07T06:08:00.050Z" level=warning msg="Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \"example-workflow-hq9pn\": the object has been modified; please apply your changes to the latest version and try again Conflict" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.050Z" level=info msg="Re-applying updates on latest version and retrying update" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.189Z" level=info msg="Update retry attempt 1 successful" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:00.189Z" level=info msg="Workflow update successful" namespace=workflows phase=Succeeded resourceVersion=1330499987 workflow=example-workflow-hq9pn
time="2023-09-07T06:08:05.272Z" level=info msg="cleaning up pod" action=deletePod key=workflows/example-workflow-hq9pn-sisyphus-project-template-fan-out-437922017/deletePod
time="2023-09-07T06:08:12.943Z" level=info msg="Processing workflow" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.946Z" level=info msg="Task-result reconciliation" namespace=workflows numObjs=0 workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.946Z" level=info msg="Workflow pod is missing" namespace=workflows nodeName="example-workflow-hq9pn.main[0].main-worker" nodePhase=Running recentlyStarted=false workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.946Z" level=info msg="node example-workflow-hq9pn-437922017 phase Running -> Error" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.946Z" level=info msg="node example-workflow-hq9pn-437922017 message: pod deleted" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.946Z" level=info msg="node example-workflow-hq9pn-437922017 finished: 2023-09-07 06:08:12.946896888 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.953Z" level=info msg="Step group node example-workflow-hq9pn-3224806949 deemed failed: child 'example-workflow-hq9pn-437922017' failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.953Z" level=info msg="node example-workflow-hq9pn-3224806949 phase Running -> Failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.953Z" level=info msg="node example-workflow-hq9pn-3224806949 message: child 'example-workflow-hq9pn-437922017' failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.953Z" level=info msg="node example-workflow-hq9pn-3224806949 finished: 2023-09-07 06:08:12.953441194 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.953Z" level=info msg="step group example-workflow-hq9pn-3224806949 was unsuccessful: child 'example-workflow-hq9pn-437922017' failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.956Z" level=info msg="Outbound nodes of example-workflow-hq9pn-437922017 is [example-workflow-hq9pn-437922017]" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.956Z" level=info msg="Outbound nodes of example-workflow-hq9pn-663554177 is [example-workflow-hq9pn-437922017]" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.956Z" level=info msg="node example-workflow-hq9pn-663554177 phase Running -> Failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.956Z" level=info msg="node example-workflow-hq9pn-663554177 message: child 'example-workflow-hq9pn-437922017' failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.956Z" level=info msg="node example-workflow-hq9pn-663554177 finished: 2023-09-07 06:08:12.956142645 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.956Z" level=info msg="Checking daemoned children of example-workflow-hq9pn-663554177" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.958Z" level=info msg="Outbound nodes of example-workflow-hq9pn set to [example-workflow-hq9pn-437922017]" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.958Z" level=info msg="node example-workflow-hq9pn phase Running -> Failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.958Z" level=info msg="node example-workflow-hq9pn finished: 2023-09-07 06:08:12.958950947 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.958Z" level=info msg="Checking daemoned children of example-workflow-hq9pn" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.958Z" level=info msg="TaskSet Reconciliation" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.959Z" level=info msg=reconcileAgentPod namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.959Z" level=info msg="Running OnExit handler: exit-handler" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.959Z" level=info msg="Steps node example-workflow-hq9pn-3648729681 initialized Running" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.959Z" level=info msg="StepGroup node example-workflow-hq9pn-1754934389 initialized Running" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:12.962Z" level=info msg="Pod node example-workflow-hq9pn-180906186 initialized Pending" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:13.450Z" level=info msg="Created pod: example-workflow-hq9pn.onExit[0].notify-failed (example-workflow-hq9pn-notify-failed-180906186)" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:13.450Z" level=info msg="Workflow step group node example-workflow-hq9pn-1754934389 not yet completed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:13.659Z" level=warning msg="Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \"example-workflow-hq9pn\": the object has been modified; please apply your changes to the latest version and try again Conflict" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:13.659Z" level=info msg="Re-applying updates on latest version and retrying update" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:13.723Z" level=info msg="Failed to re-apply update" error="must never update completed workflows" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.456Z" level=info msg="Processing workflow" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.459Z" level=info msg="Task-result reconciliation" namespace=workflows numObjs=0 workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.459Z" level=info msg="Workflow pod is missing" namespace=workflows nodeName="example-workflow-hq9pn.main[0].main-worker" nodePhase=Running recentlyStarted=false workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.459Z" level=info msg="node example-workflow-hq9pn-437922017 phase Running -> Error" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.459Z" level=info msg="node example-workflow-hq9pn-437922017 message: pod deleted" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.459Z" level=info msg="node example-workflow-hq9pn-437922017 finished: 2023-09-07 06:08:23.459449342 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.465Z" level=info msg="Step group node example-workflow-hq9pn-3224806949 deemed failed: child 'example-workflow-hq9pn-437922017' failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.465Z" level=info msg="node example-workflow-hq9pn-3224806949 phase Running -> Failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.465Z" level=info msg="node example-workflow-hq9pn-3224806949 message: child 'example-workflow-hq9pn-437922017' failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.465Z" level=info msg="node example-workflow-hq9pn-3224806949 finished: 2023-09-07 06:08:23.465387731 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.465Z" level=info msg="step group example-workflow-hq9pn-3224806949 was unsuccessful: child 'example-workflow-hq9pn-437922017' failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.467Z" level=info msg="Outbound nodes of example-workflow-hq9pn-437922017 is [example-workflow-hq9pn-437922017]" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.467Z" level=info msg="Outbound nodes of example-workflow-hq9pn-663554177 is [example-workflow-hq9pn-437922017]" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.467Z" level=info msg="node example-workflow-hq9pn-663554177 phase Running -> Failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.467Z" level=info msg="node example-workflow-hq9pn-663554177 message: child 'example-workflow-hq9pn-437922017' failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.467Z" level=info msg="node example-workflow-hq9pn-663554177 finished: 2023-09-07 06:08:23.467948371 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.467Z" level=info msg="Checking daemoned children of example-workflow-hq9pn-663554177" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.470Z" level=info msg="Outbound nodes of example-workflow-hq9pn set to [example-workflow-hq9pn-437922017]" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.470Z" level=info msg="node example-workflow-hq9pn phase Running -> Failed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.470Z" level=info msg="node example-workflow-hq9pn finished: 2023-09-07 06:08:23.47054533 +0000 UTC" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.470Z" level=info msg="Checking daemoned children of example-workflow-hq9pn" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.470Z" level=info msg="TaskSet Reconciliation" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.470Z" level=info msg=reconcileAgentPod namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.470Z" level=info msg="Running OnExit handler: exit-handler" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.470Z" level=info msg="Steps node example-workflow-hq9pn-3648729681 initialized Running" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.470Z" level=info msg="StepGroup node example-workflow-hq9pn-1754934389 initialized Running" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.473Z" level=info msg="Pod node example-workflow-hq9pn-180906186 initialized Pending" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.473Z" level=info msg="Workflow step group node example-workflow-hq9pn-1754934389 not yet completed" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.533Z" level=warning msg="Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \"example-workflow-hq9pn\": the object has been modified; please apply your changes to the latest version and try again Conflict" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.533Z" level=info msg="Re-applying updates on latest version and retrying update" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.574Z" level=info msg="Failed to re-apply update" error="must never update completed workflows" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:23.577Z" level=info msg="archiving workflow" namespace=workflows uid=c7812cc5-421a-47d4-a9fa-73a609aaba1f workflow=example-workflow-hq9pn
time="2023-09-07T06:08:33.578Z" level=info msg="Processing workflow" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:33.583Z" level=info msg="Checking daemoned children of " namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:33.619Z" level=warning msg="Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \"example-workflow-hq9pn\": the object has been modified; please apply your changes to the latest version and try again Conflict" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:33.619Z" level=info msg="Re-applying updates on latest version and retrying update" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:33.634Z" level=info msg="Failed to re-apply update" error="must never update completed workflows" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:43.765Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 59m17s due to TTL"
time="2023-09-07T06:08:53.765Z" level=info msg="Processing workflow" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:53.768Z" level=info msg="Checking daemoned children of " namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:08:53.925Z" level=info msg="Workflow update successful" namespace=workflows phase=Succeeded resourceVersion=1330502461 workflow=example-workflow-hq9pn
time="2023-09-07T06:08:59.012Z" level=info msg="cleaning up pod" action=deletePod key=workflows/example-workflow-hq9pn-notify-failed-180906186/deletePod
time="2023-09-07T06:09:09.105Z" level=info msg="Processing workflow" namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:09:09.109Z" level=info msg="Checking daemoned children of " namespace=workflows workflow=example-workflow-hq9pn
time="2023-09-07T06:09:10.918Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 58m50s due to TTL"
time="2023-09-07T06:09:47.020Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 58m13s due to TTL"
time="2023-09-07T06:09:55.646Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 58m5s due to TTL"
time="2023-09-07T06:10:14.460Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 57m46s due to TTL"
time="2023-09-07T06:10:31.717Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 57m29s due to TTL"
time="2023-09-07T06:11:01.560Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 56m59s due to TTL"
time="2023-09-07T06:11:12.029Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 56m48s due to TTL"
time="2023-09-07T06:11:13.642Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 56m47s due to TTL"
time="2023-09-07T06:11:36.341Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 56m24s due to TTL"
time="2023-09-07T06:11:41.091Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 56m19s due to TTL"
time="2023-09-07T06:12:10.538Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 55m50s due to TTL"
time="2023-09-07T06:12:32.579Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 55m28s due to TTL"
time="2023-09-07T06:12:58.111Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 55m2s due to TTL"
time="2023-09-07T06:13:06.745Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 54m54s due to TTL"
time="2023-09-07T06:13:34.199Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 54m26s due to TTL"
time="2023-09-07T06:13:52.419Z" level=info msg="Queueing Succeeded workflow workflows/example-workflow-hq9pn for delete in 54m8s due to TTL"

I’m seeing a very similar issue in v3.4.10 as above: Here’s an example debug level log trace from the workflow controller:

time=“2023-09-18T20:35:59.885Z” level=debug msg=“Executing node goc-ld-push-4kp96.checkout-sha of Pod is Running” namespace=phoenix workflow=goc-ld-push-4kp96 time=“2023-09-18T20:35:59.885Z” level=info msg=“TaskSet Reconciliation” namespace=phoenix workflow=goc-ld-push-4kp96 time=“2023-09-18T20:35:59.885Z” level=info msg=reconcileAgentPod namespace=phoenix workflow=goc-ld-push-4kp96 time=“2023-09-18T20:36:01.819Z” level=debug msg=“Get leases 200” time=“2023-09-18T20:36:01.828Z” level=debug msg=“Update leases 200” time=“2023-09-18T20:36:05.575Z” level=debug msg=“Syncing all CronWorkflows” time=“2023-09-18T20:36:06.844Z” level=debug msg=“Get leases 200” time=“2023-09-18T20:36:06.855Z” level=debug msg=“Update leases 200” time=“2023-09-18T20:36:10.565Z” level=debug msg=“Watch cronworkflows 200” time=“2023-09-18T20:36:11.872Z” level=debug msg=“Get leases 200” time=“2023-09-18T20:36:11.928Z” level=debug msg=“Update leases 200” time=“2023-09-18T20:36:15.559Z” level=debug msg=“Check the workflow existence” time=“2023-09-18T20:36:15.575Z” level=debug msg=“Syncing all CronWorkflows” time=“2023-09-18T20:36:16.939Z” level=debug msg=“Get leases 200” time=“2023-09-18T20:36:16.971Z” level=debug msg=“Update leases 200” time=“2023-09-18T20:36:21.978Z” level=debug msg=“Get leases 200” time=“2023-09-18T20:36:21.987Z” level=debug msg=“Update leases 200” time=“2023-09-18T20:36:23.477Z” level=debug msg=“Watch workflowtemplates 200” time=“2023-09-18T20:36:25.576Z” level=debug msg=“Syncing all CronWorkflows” time=“2023-09-18T20:36:27.003Z” level=debug msg=“Get leases 200” time=“2023-09-18T20:36:27.013Z” level=debug msg=“Update leases 200” time=“2023-09-18T20:36:32.022Z” level=debug msg=“Get leases 200” time=“2023-09-18T20:36:32.030Z” level=debug msg=“Update leases 200” time=“2023-09-18T20:36:35.570Z” level=debug msg=“Watch workflows 200” time=“2023-09-18T20:36:35.576Z” level=debug msg=“Syncing all CronWorkflows” time=“2023-09-18T20:36:37.039Z” level=debug msg=“Get leases 200” time=“2023-09-18T20:36:37.048Z” level=debug msg=“Update leases 200” time=“2023-09-18T20:36:38.193Z” level=debug msg=“List workflows 200” time=“2023-09-18T20:36:42.060Z” level=debug msg=“Get leases 200” time=“2023-09-18T20:36:42.066Z” level=debug msg=“Update leases 200” time=“2023-09-18T20:36:45.577Z” level=debug msg=“Syncing all CronWorkflows” time=“2023-09-18T20:36:47.080Z” level=debug msg=“Get leases 200” time=“2023-09-18T20:36:47.096Z” level=debug msg=“Update leases 200” time=“2023-09-18T20:36:52.102Z” level=debug msg=“Get leases 200” time=“2023-09-18T20:36:52.112Z” level=debug msg=“Update leases 200” time=“2023-09-18T20:36:55.577Z” level=debug msg=“Syncing all CronWorkflows” time=“2023-09-18T20:36:57.140Z” level=debug msg=“Get leases 200” time=“2023-09-18T20:36:57.150Z” level=debug msg=“Update leases 200” time=“2023-09-18T20:37:02.157Z” level=debug msg=“Get leases 200” time=“2023-09-18T20:37:02.169Z” level=debug msg=“Update leases 200” time=“2023-09-18T20:37:04.760Z” level=info msg=“Processing workflow” namespace=phoenix workflow=goc-ld-push-4kp96 time=“2023-09-18T20:37:04.760Z” level=debug msg="unresolved is allowed " error=unresolved time=“2023-09-18T20:37:04.760Z” level=debug msg="unresolved is allowed " error=unresolved time=“2023-09-18T20:37:04.760Z” level=debug msg="unresolved is allowed " error=unresolved time=“2023-09-18T20:37:04.760Z” level=info msg=“Task-result reconciliation” namespace=phoenix numObjs=0 workflow=goc-ld-push-4kp96 time=“2023-09-18T20:37:04.760Z” level=info msg=“Workflow pod is missing” namespace=phoenix nodeName=goc-ld-push-4kp96.checkout-sha nodePhase=Running recentlyStarted=false workflow=goc-ld-push-4kp96 time=“2023-09-18T20:37:04.760Z” level=info msg=“node goc-ld-push-4kp96-857076312 phase Running -> Error” namespace=phoenix workflow=goc-ld-push-4kp96 time=“2023-09-18T20:37:04.760Z” level=info msg=“node goc-ld-push-4kp96-857076312 message: pod deleted” namespace=phoenix workflow=goc-ld-push-4kp96 time=“2023-09-18T20:37:04.761Z” level=info msg=“node goc-ld-push-4kp96-857076312 finished: 2023-09-18 20:37:04.761002282 +0000 UTC” namespace=phoenix workflow=goc-ld-push-4kp96

From what I can see in the pod log, the pod actually succeeded without any error, but the workflow controller thought the pod was missing and thus updated the workflow status to failure.

Thanks @jmaicher

We ended up migrating to v3.4.4 and it’s been working without issues now (6 days so far), so seems that this bug has been introduced in a version between [3.4.5, 3.4.9]

I checked the k8s apiserver audit logs for one of the pods (used in a workflow) that was not deleted by workflow-controller GC (but should have): The last row is manual deletion that was done the next day. And the argo workflow-controller should have deleted it way earlier.

Screenshot from 2023-08-16 18-03-23