argo-workflows: Cannot terminate nested parallelism workflow with synchronization semaphore
Summary
What happened/what you expected to happen?
When I terminate a workflow with nested parallelism and configmap to limit number of parallel jobs, some of the jobs are stuck and does not get terminated, and are waiting for configmap lock variable (lock variable seem to be not released, see screenshots). I used examples/parallelism-nested.yaml as an example, and added “synchronization semaphore” into worker (see source of “bugged” workflow below).
What version of Argo Workflows are you running?
3.2.6
Diagnostics
Either a workflow that reproduces the bug, or paste you whole workflow YAML, including status, something like:
kubectl get wf -o yaml ${workflow}
Working workflow yaml: test3_OK.txt
Bugged workflow yaml: test2_BUGGED.txt
Full workflow yamls from kubectl:
kubectl_get_wf_gtfdh_OK.txt kubectl_get_wf_rk4r2_BUGGED.txt
What Kubernetes provider are you using?
AWS EKS
What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary
default settings for argo v3.2: docker
Logs from the workflow controller:
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
kubectl_logs_workflow_controller_gtfdh_OK.txt kubectl_logs_workflow_controller_rk4r2_BUGGED.txt
The workflow’s pods that are problematic:
kubectl get pod -o yaml -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
kubectl_get_pod_not_succeed_gtfdh_OK.txt kubectl_get_pod_not_succeed_rk4r2_BUGGED.txt
Logs from in your workflow’s wait container, something like:
kubectl logs -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
kubectl_logs_wait_not_succeed_gtfdh_OK.txt kubectl_logs_wait_not_succeed_rk4r2_BUGGED.txt
Configmap “count” is set to “2”:
Pod list from “bugged” workflow - finished, skipped (after termination), waiting (should be skipped, but are pending - the bug)
Phase - succeed:
parallelism-nested-rk4r2-1802331121
parallelism-nested-rk4r2-2087857811
parallelism-nested-rk4r2-2853201953
parallelism-nested-rk4r2-3699907663
Phase - skipped:
parallelism-nested-rk4r2-1330979975
parallelism-nested-rk4r2-61384565
parallelism-nested-rk4r2-61384565
parallelism-nested-rk4r2-2957862257
parallelism-nested-rk4r2-3316765249
parallelism-nested-rk4r2-1496436579
parallelism-nested-rk4r2-986178545
parallelism-nested-rk4r2-2034718815
Phase - pending (the bug):
parallelism-nested-rk4r2-1156447283
parallelism-nested-rk4r2-3339628721
parallelism-nested-rk4r2-1168898307
parallelism-nested-rk4r2-3911978837
parallelism-nested-rk4r2-12399201
parallelism-nested-rk4r2-2657638275
parallelism-nested-rk4r2-2147380241
parallelism-nested-rk4r2-2983743359
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 9
- Comments: 15 (8 by maintainers)
This bug have been solved?
Hi everyone, this bug is still an issue in Argo 3.4.10 (running on AKS). When users terminate workflows that acquire locks on a semaphore using the Argo Ui, these locks are sporadically not released. This causes the number of locks on the semaphore to gradually build up. As a workaround, you can manually restart the Argo controller to release the locks. Furthermore, you can delete workflows instead of terminating them, which seems to release the locks on the semaphore correctly. We have tested the bugged workflow yaml from @lstolcman and can reproduce the error. Could you please reopen the ticket, since the bug still persists ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been closed due to inactivity. Feel free to re-open if you still encounter this issue.