argo-workflows: podGC: onPodCompletion not working properly in v3.4.9
Pre-requisites
- I have double-checked my configuration
- I can confirm the issues exists when I tested with
:latest
- I’d like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
Workflow-controller garbage collection doesn’t work properly.
After we updated our Argo-workflows from 3.3.8
to 3.4.9
there was issue with the argo controller failing to remove complete pods. We are using podGC
: OnPodCompletion
strategy for all workflows in our controller-configmap. Most of the time the pod was removed properly, but maybe 1-5% of the workflows it didn’t remove the finished pod properly and the pod just stayed in the cluster as completed.
When looking at the workflow-controller logs there seems to be some issues with the garbage collection, it seems to keep putting same workflow to the garbage collection queue multiple times, even tho it should only be put there once. Now the logs are constantly spamming lines like these, while previously with 3.3.8
this wasn’t the case.
I don’t get why this workflow is put into the queue multiple times? Once should be enough, and looking at previous version logs that’s what was the case.
Here’s the number of log lines from the controller from our DataDog (ignore the color changes). The number of log lines was stable low in the previous version, but when we updated to 3.4.9
the number of log lines went up. At the end of the image we decided to go back to 3.3.8
and the number of log lines are back to previous level.
Version
v3.4.9
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don’t enter a workflows that uses private images.
-
Logs from the workflow controller
$ kubectl logs workflow-controller-7f8bcdc656-9gvg4 | grep "argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt"
time="2023-08-16T09:14:55.884Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 6h17m37s due to TTL"
time="2023-08-16T09:15:00.118Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 6h17m32s due to TTL"
time="2023-08-16T09:22:42.304Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 6h9m50s due to TTL"
time="2023-08-16T09:32:28.796Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 6h0m4s due to TTL"
time="2023-08-16T09:54:55.019Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h37m37s due to TTL"
time="2023-08-16T09:57:37.509Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h34m55s due to TTL"
time="2023-08-16T10:02:42.185Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h29m50s due to TTL"
time="2023-08-16T10:10:17.360Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h22m15s due to TTL"
time="2023-08-16T10:14:55.166Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h17m37s due to TTL"
time="2023-08-16T10:19:45.878Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h12m47s due to TTL"
time="2023-08-16T10:25:22.294Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 5h7m10s due to TTL"
time="2023-08-16T10:33:06.901Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h59m26s due to TTL"
time="2023-08-16T10:34:56.043Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h57m36s due to TTL"
time="2023-08-16T10:41:50.563Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h50m42s due to TTL"
time="2023-08-16T10:46:52.196Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h45m40s due to TTL"
time="2023-08-16T10:54:55.429Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h37m37s due to TTL"
time="2023-08-16T10:55:52.254Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h36m40s due to TTL"
time="2023-08-16T11:02:24.189Z" level=info msg="Queueing Succeeded workflow argo-managed-processor-staging/ca-keystone-feature-parity-testing-q42nt for delete in 4h30m8s due to TTL"
Logs from in your workflow’s wait container
-
About this issue
- Original URL
- State: open
- Created 10 months ago
- Reactions: 9
- Comments: 16 (4 by maintainers)
We see a similar issue after upgrading from 3.4.4 to 3.4.9. In our case, workflows occasionally complete successfully but somehow enter error/failure state afterwards and the exit handler gets triggered again and sends a failure notification.
Here is our workflow (simplified):
From the workflow controller logs below you can see that the workflow completes at
2023-09-07T06:08:00.021Z
and is marked as pending for archiving. The exit handler was skipped becauseSucceeded != Succeeded
evaluated false. At2023-09-07T06:08:12.946Z
, the task-result reconciliation runs, the workflow pod is missing and the main step fails. The exit handler is then triggered again at2023-09-07T06:08:12.959Z
and fires the failure notification. Since the workflow already completed, the workflow update fails around2023-09-07T06:08:23.574Z
with “must never update completed workflows”. The exit handler is then triggered again but doesn’t send another failure notification.I’m seeing a very similar issue in v3.4.10 as above: Here’s an example debug level log trace from the workflow controller:
From what I can see in the pod log, the pod actually succeeded without any error, but the workflow controller thought the pod was missing and thus updated the workflow status to failure.
Thanks @jmaicher
We ended up migrating to
v3.4.4
and it’s been working without issues now (6 days so far), so seems that this bug has been introduced in a version between[3.4.5, 3.4.9]
I checked the k8s apiserver audit logs for one of the pods (used in a workflow) that was not deleted by workflow-controller GC (but should have): The last row is manual deletion that was done the next day. And the argo workflow-controller should have deleted it way earlier.