argo-workflows: Workflow deadlocks on mutex in steps template if controller is restarted

Checklist

  • Double-checked my configuration
  • Tested using the latest version.
  • Used the Emissary executor.

Summary

What happened/what you expected to happen?

I expect my workflow to complete successfully even if the workflows controller is restarted. The same issue does not occur if the mutex is on the container template or at the workflow level.

What version are you running? v3.3.4 is what I’ve been using for this reproduction, but I’ve also been able to reproduce it in v3.3.3, v3.3.2, and v3.2.7 with a more complicated workflow.

Diagnostics

This is a workflow that reliably reproduces the issue. The steps I use are:

argo submit deadlock.yaml
sleep 5
kubectl delete pod -l app.kubernetes.io/component=workflow-controller
# deadlock.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: deadlock-test-
spec:
  entrypoint: main-steps

  templates:
    - name: main-steps
      synchronization:
        mutex:
          name: "mutex-{{workflow.name}}"
      steps:
        - - name: main
            template: main-container

    - name: main-container
      container:
        image: "busybox:latest"
        command: ["sh"]
        args: ["-c", "sleep 10; exit 0"]

Pod logs No pod logs from the main container since it just sleeps and exits.

Wait container logs

time="2022-05-09T12:57:42.216Z" level=info msg="Starting Workflow Executor" executorType=docker version=v3.3.2
time="2022-05-09T12:57:42.217Z" level=info msg="Creating a docker executor"
time="2022-05-09T12:57:42.217Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2022-05-09T12:57:42.217Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=kube-system podName=deadlock-test-bnbjc-938207750 template="{\"name\":\"main-container\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"busybox:latest\",\"command\":[\"sh\"],\"args\":[\"-c\",\"sleep 10; exit 0\"],\"resources\":{}}}" version="&Version{Version:v3.3.2,BuildDate:2022-04-20T21:41:09Z,GitCommit:35492a1700a0f279694cac874b6d9c07a08265d1,GitTag:v3.3.2,GitTreeState:clean,GoVersion:go1.17.9,Compiler:gc,Platform:linux/arm64,}"
time="2022-05-09T12:57:42.217Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:42.218Z" level=info msg="Starting deadline monitor"
time="2022-05-09T12:57:42.231Z" level=info msg="listed containers" containers="map[wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:42.231Z" level=info msg="mapped container name \"wait\" to container ID \"a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522\" (created at 2022-05-09 12:57:42 +0000 UTC, status Up)"
time="2022-05-09T12:57:43.232Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:43.248Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:43.248Z" level=info msg="mapped container name \"main\" to container ID \"0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862\" (created at 2022-05-09 12:57:43 +0000 UTC, status Up)"
time="2022-05-09T12:57:44.220Z" level=info msg="docker wait 0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862"
time="2022-05-09T12:57:44.248Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:44.269Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:45.270Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:45.287Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:46.288Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:46.323Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:47.325Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:47.350Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:48.351Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:48.380Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:49.380Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:49.409Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:50.411Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:50.449Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:51.450Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:51.490Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:52.491Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:52.536Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:53.289Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:53.310Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Exited {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:53.310Z" level=info msg="Main container completed"
time="2022-05-09T12:57:53.310Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2022-05-09T12:57:53.310Z" level=info msg="No output parameters"
time="2022-05-09T12:57:53.310Z" level=info msg="No output artifacts"
time="2022-05-09T12:57:53.310Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:53.327Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Exited {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:53.327Z" level=info msg="Killing sidecars []"
time="2022-05-09T12:57:53.327Z" level=info msg="Alloc=6106 TotalAlloc=10797 Sys=19410 NumGC=3 Goroutines=8"

Workflow controller logs

Before killing (old workflow controller pod): logs-before.txt

After killing (new workflow controller pod): logs-after.txt

Workflow resource

Looking at the workflow after a while, it shows a deadlock in the synchronization section (sorry this doesn’t match the logs above, but it’s the same structure every time). It’s holding and waiting on the same lock.

  synchronization:
    mutex:
      holding:
      - holder: deadlock-test-4lp66
        mutex: kube-system/Mutex/mutex-deadlock-test-4lp66
      waiting:
      - holder: kube-system/deadlock-test-4lp66
        mutex: kube-system/Mutex/mutex-deadlock-test-4lp66

workflow-resource.txt (uploaded as txt cause GitHub doesn’t allow .yaml)


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 17
  • Comments: 27 (11 by maintainers)

Most upvoted comments

I think we are also hitting this issue - we have just upgraded to the latest argo 3.4.9, if that’s helpful. Right now our synchronization semaphore is configured like this:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: template-name
spec:
  templates:
    - name: template-name
       synchronization:
         semaphore:
           configMapKeyRef:
             name: template-synchronization-config
             key: template-key
       container: # rest of the template config here.

I’m trying to decipher the comments in this issue to see if there’s a different way we can have this synchronization in place that won’t be affected while the refactor is going on.

Is there an easy way to clear out the semaphore locks on argo’s side? Restarting the workflow-controller doesn’t seem to be viable to recover.

We also had this over v3.5.2…the solution was to restart the controller. It does mean we have a major issue with the workflows using the Mutex feature. Does a Semaphor with value 1 would mitigate that?

@caelan-io I can confirm this issue is still a problem unfortunately.

I believe my original comment is still correct something else is going on. The comment I made previously is now fixed in :latest, something else is causing it now. Looking into it again, will update when done.