argo-workflows: DAG/STEPS Hang v3.0.2 - Sidecars not being killed

Summary

What happened?

DAG tasks randomly hang
Screen Shot 2021-04-29 at 7 01 35 PM image

What did you expect to happen?

DAG tasks successfully finished

Diagnostics

What Kubernetes provider are you using?

GKE
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.9-gke.1900", GitCommit:"008fd38bf3dc201bebdd4fe26edf9bf87478309a", GitTreeState:"clean", BuildDate:"2021-04-14T09:22:08Z", GoVersion:"go1.15.8b5", Compiler:"gc", Platform:"linux/amd64"}

What version of Argo Workflows are you running?

v3.0.2

kubectl get wf -o yaml ${workflow}

The workflow contains sensitive information regarding our organization, if it’s important reach out me on CNCF slack

kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name) | grep ${workflow}

controller-dag-hang.txt

Wait container logs

W
{},\"mirrorVolumeMounts\":true}],\"sidecars\":[{\"name\":\"mysql\",\"image\":\"mysql:5.6\",\"env\":[{\"name\":\"MYSQL_ALLOW_EMPTY_PASSWORD\",\"value\":\"true\"}],\"reso
urces\":{},\"mirrorVolumeMounts\":true},{\"name\":\"redis\",\"image\":\"redis:alpine3.13\",\"resources\":{},\"mirrorVolumeMounts\":true},{\"name\":\"nginx\",\"image\":\
"nginx:1.19.7-alpine\",\"resources\":{},\"mirrorVolumeMounts\":true}],\"archiveLocation\":{\"archiveLogs\":true,\"gcs\":{\"bucket\":\"7shitfs-argo-workflow-artifacts\",
\"serviceAccountKeySecret\":{\"name\":\"devops-argo-workflow-sa\",\"key\":\"credentials.json\"},\"key\":\"argo-workflow-logs/2021/04/29/github-20979-9df1440/github-2097
9-9df1440-2290904989\"}},\"retryStrategy\":{\"limit\":\"1\",\"retryPolicy\":\"Always\"},\"tolerations\":[{\"key\":\"node_type\",\"operator\":\"Equal\",\"value\":\"large
\",\"effect\":\"NoSchedule\"}],\"hostAliases\":[{\"ip\":\"127.0.0.1\",\"hostnames\":[\"xyz.dev\",\"xyz.test\",\"cypress.xyz.test\",\"codeception.xyz.dev
\"]}],\"podSpecPatch\":\"containers:\\n- name: main\\n  resources:\\n    request:\\n      memory: \\\"8Gi\\\"\\n      cpu: \\\"2\\\"\\n    limits:\\n      memory: \\\"8
Gi\\\"\\n      cpu: \\\"2\\\"\\n- name: mysql\\n  resources:\\n    request:\\n      memory: \\\"2Gi\\\"\\n      cpu: \\\"0.5\\\"\\n    limits:\\n      memory: \\\"2Gi\\
\"\\n      cpu: \\\"0.5\\\"\\n- name: redis\\n  resources:\\n    request:\\n      memory: \\\"50Mi\\\"\\n      cpu: \\\"0.05\\\"\\n    limits:\\n      memory: \\\"50Mi\
\\"\\n      cpu: \\\"0.05\\\"\\n- name: nginx\\n  resources:\\n    request:\\n      memory: \\\"50Mi\\\"\\n      cpu: \\\"0.05\\\"\\n    limits:\\n      memory: \\\"50M
i\\\"\\n      cpu: \\\"0.05\\\"\\n\",\"timeout\":\"1200s\"}"
time="2021-04-29T22:33:05.291Z" level=info msg="Starting annotations monitor"
time="2021-04-29T22:33:05.291Z" level=info msg="Starting deadline monitor"
time="2021-04-29T22:33:10.299Z" level=info msg="Watch pods 200"
time="2021-04-29T22:38:05.291Z" level=info msg="Alloc=4475 TotalAlloc=47692 Sys=75089 NumGC=15 Goroutines=10"
time="2021-04-29T22:42:44.410Z" level=info msg="Main container completed"
time="2021-04-29T22:42:44.410Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-04-29T22:42:44.410Z" level=info msg="Capturing script exit code"
time="2021-04-29T22:42:44.410Z" level=info msg="Getting exit code of main"
time="2021-04-29T22:42:44.413Z" level=info msg="Get pods 200"
time="2021-04-29T22:42:44.414Z" level=info msg="Saving logs"
time="2021-04-29T22:42:44.415Z" level=info msg="Getting output of main"
time="2021-04-29T22:42:44.424Z" level=info msg="List log 200"
time="2021-04-29T22:42:44.427Z" level=info msg="GCS Save path: /tmp/argo/outputs/logs/main.log, key: argo-workflow-logs/2021/04/29/github-20979-9df1440/github-20979-9df
1440-2290904989/main.log"
time="2021-04-29T22:42:44.763Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2021-04-29T22:42:44.763Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2021-04-29T22:42:44.763Z" level=info msg="No output parameters"
time="2021-04-29T22:42:44.763Z" level=info msg="No output artifacts"
time="2021-04-29T22:42:44.763Z" level=info msg="Annotating pod with output"
time="2021-04-29T22:42:44.778Z" level=info msg="Patch pods 200"
time="2021-04-29T22:42:44.779Z" level=info msg="Killing sidecars []"
time="2021-04-29T22:42:44.779Z" level=info msg="Alloc=28577 TotalAlloc=72566 Sys=75089 NumGC=18 Goroutines=11"

I’ve been continuously trying to upgrade our argo workflow version, but since 3.x.x dag tasks stopped working properly. I’m currently using v2.12 with no problems at all.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 5
  • Comments: 39 (39 by maintainers)

Commits related to this issue

Most upvoted comments

I meant of argoexec

@alexec It’s working really good. I don’t want to downgrade again, so I really hope we can keep using it until the fix launch. Keep me posted about it.

Thank you

@caueasantos I’ve performed a code tidy up and pushed the changes. It would be great if you could check them, just in case I’ve somehow reverted the fix (that does occasionally happen). Thank you again for taking some much time to help test this.

Ok. That’s a bug, which I think I’ve just fixed. Can you try again please?

argoproj/argoexec:dev-5779

I’ve created a new image. Can you please test it?

Could you please provide the logs from v2.12?

Thanks for giving it a go. I’ll look to repro tomorrow and see what I discover

@alexec failed again

Executor PNS Image argoproj/argoexec:dev-5779

get-pod-json wait-sidecars-json

I hypothesise that your sidecar containers need >5s to start. I’ve created a fix: argoproj/argoexec:dev-5779 - could you please try that out?

Sure. Let me try a few more times with the new image.