argo-workflows: DAG/STEPS Hang v3.0.2 - Sidecars not being killed
Summary
What happened?
DAG tasks randomly hang
What did you expect to happen?
DAG tasks successfully finished
Diagnostics
What Kubernetes provider are you using?
GKE
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.9-gke.1900", GitCommit:"008fd38bf3dc201bebdd4fe26edf9bf87478309a", GitTreeState:"clean", BuildDate:"2021-04-14T09:22:08Z", GoVersion:"go1.15.8b5", Compiler:"gc", Platform:"linux/amd64"}
What version of Argo Workflows are you running?
v3.0.2
kubectl get wf -o yaml ${workflow}
The workflow contains sensitive information regarding our organization, if it’s important reach out me on CNCF slack
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name) | grep ${workflow}
Wait container logs
W
{},\"mirrorVolumeMounts\":true}],\"sidecars\":[{\"name\":\"mysql\",\"image\":\"mysql:5.6\",\"env\":[{\"name\":\"MYSQL_ALLOW_EMPTY_PASSWORD\",\"value\":\"true\"}],\"reso
urces\":{},\"mirrorVolumeMounts\":true},{\"name\":\"redis\",\"image\":\"redis:alpine3.13\",\"resources\":{},\"mirrorVolumeMounts\":true},{\"name\":\"nginx\",\"image\":\
"nginx:1.19.7-alpine\",\"resources\":{},\"mirrorVolumeMounts\":true}],\"archiveLocation\":{\"archiveLogs\":true,\"gcs\":{\"bucket\":\"7shitfs-argo-workflow-artifacts\",
\"serviceAccountKeySecret\":{\"name\":\"devops-argo-workflow-sa\",\"key\":\"credentials.json\"},\"key\":\"argo-workflow-logs/2021/04/29/github-20979-9df1440/github-2097
9-9df1440-2290904989\"}},\"retryStrategy\":{\"limit\":\"1\",\"retryPolicy\":\"Always\"},\"tolerations\":[{\"key\":\"node_type\",\"operator\":\"Equal\",\"value\":\"large
\",\"effect\":\"NoSchedule\"}],\"hostAliases\":[{\"ip\":\"127.0.0.1\",\"hostnames\":[\"xyz.dev\",\"xyz.test\",\"cypress.xyz.test\",\"codeception.xyz.dev
\"]}],\"podSpecPatch\":\"containers:\\n- name: main\\n resources:\\n request:\\n memory: \\\"8Gi\\\"\\n cpu: \\\"2\\\"\\n limits:\\n memory: \\\"8
Gi\\\"\\n cpu: \\\"2\\\"\\n- name: mysql\\n resources:\\n request:\\n memory: \\\"2Gi\\\"\\n cpu: \\\"0.5\\\"\\n limits:\\n memory: \\\"2Gi\\
\"\\n cpu: \\\"0.5\\\"\\n- name: redis\\n resources:\\n request:\\n memory: \\\"50Mi\\\"\\n cpu: \\\"0.05\\\"\\n limits:\\n memory: \\\"50Mi\
\\"\\n cpu: \\\"0.05\\\"\\n- name: nginx\\n resources:\\n request:\\n memory: \\\"50Mi\\\"\\n cpu: \\\"0.05\\\"\\n limits:\\n memory: \\\"50M
i\\\"\\n cpu: \\\"0.05\\\"\\n\",\"timeout\":\"1200s\"}"
time="2021-04-29T22:33:05.291Z" level=info msg="Starting annotations monitor"
time="2021-04-29T22:33:05.291Z" level=info msg="Starting deadline monitor"
time="2021-04-29T22:33:10.299Z" level=info msg="Watch pods 200"
time="2021-04-29T22:38:05.291Z" level=info msg="Alloc=4475 TotalAlloc=47692 Sys=75089 NumGC=15 Goroutines=10"
time="2021-04-29T22:42:44.410Z" level=info msg="Main container completed"
time="2021-04-29T22:42:44.410Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-04-29T22:42:44.410Z" level=info msg="Capturing script exit code"
time="2021-04-29T22:42:44.410Z" level=info msg="Getting exit code of main"
time="2021-04-29T22:42:44.413Z" level=info msg="Get pods 200"
time="2021-04-29T22:42:44.414Z" level=info msg="Saving logs"
time="2021-04-29T22:42:44.415Z" level=info msg="Getting output of main"
time="2021-04-29T22:42:44.424Z" level=info msg="List log 200"
time="2021-04-29T22:42:44.427Z" level=info msg="GCS Save path: /tmp/argo/outputs/logs/main.log, key: argo-workflow-logs/2021/04/29/github-20979-9df1440/github-20979-9df
1440-2290904989/main.log"
time="2021-04-29T22:42:44.763Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2021-04-29T22:42:44.763Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2021-04-29T22:42:44.763Z" level=info msg="No output parameters"
time="2021-04-29T22:42:44.763Z" level=info msg="No output artifacts"
time="2021-04-29T22:42:44.763Z" level=info msg="Annotating pod with output"
time="2021-04-29T22:42:44.778Z" level=info msg="Patch pods 200"
time="2021-04-29T22:42:44.779Z" level=info msg="Killing sidecars []"
time="2021-04-29T22:42:44.779Z" level=info msg="Alloc=28577 TotalAlloc=72566 Sys=75089 NumGC=18 Goroutines=11"
I’ve been continuously trying to upgrade our argo workflow version, but since 3.x.x dag tasks stopped working properly. I’m currently using v2.12 with no problems at all.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 5
- Comments: 39 (39 by maintainers)
Commits related to this issue
- fix(executor): Allow 1m for PNS executor to secure file handles. Fixes #5779 Signed-off-by: Alex Collins <alex_collins@intuit.com> — committed to argoproj/argo-workflows by alexec 3 years ago
- fix(executor): Allow 1m for PNS executor to secure file handles. Fixes #5779 Signed-off-by: Alex Collins <alex_collins@intuit.com> — committed to argoproj/argo-workflows by alexec 3 years ago
- fix(executor): Enable PNS executor to better kill sidecars. Fixes #5779 (#5794) — committed to argoproj/argo-workflows by alexec 3 years ago
- fix(executor): Enable PNS executor to better kill sidecars. Fixes #5779 (#5794) — committed to argoproj/argo-workflows by alexec 3 years ago
I meant of argoexec
@alexec It’s working really good. I don’t want to downgrade again, so I really hope we can keep using it until the fix launch. Keep me posted about it.
Thank you
@caueasantos I’ve performed a code tidy up and pushed the changes. It would be great if you could check them, just in case I’ve somehow reverted the fix (that does occasionally happen). Thank you again for taking some much time to help test this.
Ok. That’s a bug, which I think I’ve just fixed. Can you try again please?
argoproj/argoexec:dev-5779
I’ve created a new image. Can you please test it?
Could you please provide the logs from v2.12?
Thanks for giving it a go. I’ll look to repro tomorrow and see what I discover
@alexec failed again
Executor
PNS
Imageargoproj/argoexec:dev-5779
get-pod-json wait-sidecars-json
Sure. Let me try a few more times with the new image.