argo-workflows: Cannot terminate nested parallelism workflow with synchronization semaphore

Summary

What happened/what you expected to happen?

When I terminate a workflow with nested parallelism and configmap to limit number of parallel jobs, some of the jobs are stuck and does not get terminated, and are waiting for configmap lock variable (lock variable seem to be not released, see screenshots). I used examples/parallelism-nested.yaml as an example, and added “synchronization semaphore” into worker (see source of “bugged” workflow below).

obraz obraz

What version of Argo Workflows are you running?

3.2.6

Diagnostics

Either a workflow that reproduces the bug, or paste you whole workflow YAML, including status, something like:

kubectl get wf -o yaml ${workflow}

Working workflow yaml: test3_OK.txt

Bugged workflow yaml: test2_BUGGED.txt

Full workflow yamls from kubectl:

kubectl_get_wf_gtfdh_OK.txt kubectl_get_wf_rk4r2_BUGGED.txt

What Kubernetes provider are you using?

AWS EKS

What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary

default settings for argo v3.2: docker

Logs from the workflow controller:

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

kubectl_logs_workflow_controller_gtfdh_OK.txt kubectl_logs_workflow_controller_rk4r2_BUGGED.txt

The workflow’s pods that are problematic:

kubectl get pod -o yaml -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

kubectl_get_pod_not_succeed_gtfdh_OK.txt kubectl_get_pod_not_succeed_rk4r2_BUGGED.txt

Logs from in your workflow’s wait container, something like:

kubectl logs -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

kubectl_logs_wait_not_succeed_gtfdh_OK.txt kubectl_logs_wait_not_succeed_rk4r2_BUGGED.txt

Configmap “count” is set to “2”: obraz

Pod list from “bugged” workflow - finished, skipped (after termination), waiting (should be skipped, but are pending - the bug)

Phase - succeed:
parallelism-nested-rk4r2-1802331121
parallelism-nested-rk4r2-2087857811
parallelism-nested-rk4r2-2853201953
parallelism-nested-rk4r2-3699907663

Phase - skipped:
parallelism-nested-rk4r2-1330979975
parallelism-nested-rk4r2-61384565
parallelism-nested-rk4r2-61384565
parallelism-nested-rk4r2-2957862257

parallelism-nested-rk4r2-3316765249
parallelism-nested-rk4r2-1496436579
parallelism-nested-rk4r2-986178545
parallelism-nested-rk4r2-2034718815

Phase - pending (the bug):
parallelism-nested-rk4r2-1156447283
parallelism-nested-rk4r2-3339628721
parallelism-nested-rk4r2-1168898307
parallelism-nested-rk4r2-3911978837

parallelism-nested-rk4r2-12399201
parallelism-nested-rk4r2-2657638275
parallelism-nested-rk4r2-2147380241
parallelism-nested-rk4r2-2983743359

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 9
Comments: 15 (8 by maintainers)

Most upvoted comments

This bug have been solved?

andyzheung on Jun 18, 2022

Hi everyone, this bug is still an issue in Argo 3.4.10 (running on AKS). When users terminate workflows that acquire locks on a semaphore using the Argo Ui, these locks are sporadically not released. This causes the number of locks on the semaphore to gradually build up. As a workaround, you can manually restart the Argo controller to release the locks. Furthermore, you can delete workflows instead of terminating them, which seems to release the locks on the semaphore correctly. We have tested the bugged workflow yaml from @lstolcman and can reproduce the error. Could you please reopen the ticket, since the bug still persists ?

n1klasD on Aug 25, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on Mar 2, 2022

This issue has been closed due to inactivity. Feel free to re-open if you still encounter this issue.

stale[bot] on Aug 13, 2022