argo-workflows: Random failures with message "Error: No such container:path: ... This does not look like a tar archive tar"
Downstream KFP issue: https://github.com/kubeflow/pipelines/issues/5943
Summary
What happened/what you expected to happen?
When I run many workflows in a cluster, some workflows fail at random (5~10%) steps with errors like the following message:
This step is in Error state with this message: Error (exit code 1): Error: No such container:path: 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data tar: This does not look like a tar archive tar: Exiting with failure status due to previous errors
The error can happen at any step randomly.
I’d expect argo can identify root cause of this and return a meaningful error message.
Diagnostics
👀 Yes! We need all of your diagnostics, please make sure you add it all, otherwise we’ll go around in circles asking you for it:
What Kubernetes provider are you using? GKE (I’m using Kubernetes 1.18 with Container-optimized OS with docker nodes)
What version of Argo Workflows are you running? v3.1.1
What executor are you running? Docker
Did this work in a previous version? I.e. is it a regression? I’m not seeing this before in v2.12.x and v3.0.x
Investigations
After reading through the detailed logs, here’s my understanding of the problem:
- Argo creates a Pod A.
- There are too many Pods in the same node, kubelet decided to kill Pod A to leave enough resources for other Pods. (see logs below “Killing unwanted pod”)
Therefore, all the containers receive a TERM signal at the same time.(this is probably wrong)- wait container tries to get output artifacts/parameters before main container finishes – see detailed logs below (wait container finished at 2021-07-16T08:55:36Z, earlier than main container finished at 2021-07-16T08:55:38Z).
- wait container failed to get outputs from main container:
sh -c docker cp -a 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb:/tmp/outputs/Output/data - | gzip > /tmp/argo/outputs/artifacts/training-op-Output.tgz
- got final error message
Error: No such container:path: 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data tar: This does not look like a tar archive tar
ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination
Suggestions
I don’t understand wait container details.
For the final error message, I think we can improve by fixing the sh -c docker cp -a 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb:/tmp/outputs/Output/data - | gzip > /tmp/argo/outputs/artifacts/training-op-Output.tgz
command.
Note, this script’s exit code is always the second command gzip > /tmp/.....
. It doesn’t matter the first command failed or not. This is a problem of shell when using pipe.
One workaround is using go to pipe command outputs: https://stackoverflow.com/a/10781582/8745218 and check exit code of both commands properly, so that we’ll return error message of the docker cp
command when it fails. I think the file /tmp/outputs/Output/data doesn’t exist at the time this command was run.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 14
- Comments: 22 (22 by maintainers)
Commits related to this issue
- fix(executor/docker): fix random errors with message "No such container:path". Fixes #6352 Signed-off-by: Yuan Gong <gongyuan94@gmail.com> — committed to Bobgy/argo-workflows by Bobgy 3 years ago
- fix(executor/docker): fix random errors with message "No such container:path". Fixes #6352 Signed-off-by: Yuan Gong <gongyuan94@gmail.com> — committed to Bobgy/argo-workflows by Bobgy 3 years ago
- fix(executor/docker): fix random errors with message "No such container:path". Fixes #6352 (#6483) Signed-off-by: Yuan Gong <gongyuan94@gmail.com> — committed to argoproj/argo-workflows by Bobgy 3 years ago
- Revert "fix(executor/docker): fix random errors with message "No such container:path". Fixes #6352 (#6483)" This reverts commit e4a53d4bf021fd4dce1374bb7fd4320d733e57ba. Signed-off-by: Alex Collins ... — committed to argoproj/argo-workflows by alexec 3 years ago
- Revert "Revert "fix(executor/docker): fix random errors with message "No such container:path". Fixes #6352 (#6483)"" This reverts commit a3fd704a1715900f2144c0362e562f75f1524126. Signed-off-by: Yuan... — committed to Bobgy/argo-workflows by Bobgy 3 years ago
- fix(executor/docker): re-revert -- fix random errors with message "No such container:path". Fixes #6352 (#6508) * Revert "Revert "fix(executor/docker): fix random errors with message "No such contain... — committed to argoproj/argo-workflows by Bobgy 3 years ago
- fix(executor/docker): re-revert -- fix random errors with message "No such container:path". Fixes #6352 (#6508) * Revert "Revert "fix(executor/docker): fix random errors with message "No such contain... — committed to argoproj/argo-workflows by Bobgy 3 years ago
Run
make test-executor E2E_EXECUTOR=docker