argo-workflows: Random failures with message "Error: No such container:path: ... This does not look like a tar archive tar"

Downstream KFP issue: https://github.com/kubeflow/pipelines/issues/5943

Summary

What happened/what you expected to happen?

When I run many workflows in a cluster, some workflows fail at random (5~10%) steps with errors like the following message:

This step is in Error state with this message: Error (exit code 1): Error: No such container:path: 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data tar: This does not look like a tar archive tar: Exiting with failure status due to previous errors

The error can happen at any step randomly.

I’d expect argo can identify root cause of this and return a meaningful error message.

Diagnostics

👀 Yes! We need all of your diagnostics, please make sure you add it all, otherwise we’ll go around in circles asking you for it:

What Kubernetes provider are you using? GKE (I’m using Kubernetes 1.18 with Container-optimized OS with docker nodes)

What version of Argo Workflows are you running? v3.1.1

What executor are you running? Docker

Did this work in a previous version? I.e. is it a regression? I’m not seeing this before in v2.12.x and v3.0.x

Investigations

After reading through the detailed logs, here’s my understanding of the problem:

  1. Argo creates a Pod A.
  2. There are too many Pods in the same node, kubelet decided to kill Pod A to leave enough resources for other Pods. (see logs below “Killing unwanted pod”)
  3. Therefore, all the containers receive a TERM signal at the same time. (this is probably wrong)
  4. wait container tries to get output artifacts/parameters before main container finishes – see detailed logs below (wait container finished at 2021-07-16T08:55:36Z, earlier than main container finished at 2021-07-16T08:55:38Z).
  5. wait container failed to get outputs from main container: sh -c docker cp -a 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb:/tmp/outputs/Output/data - | gzip > /tmp/argo/outputs/artifacts/training-op-Output.tgz
  6. got final error message Error: No such container:path: 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data tar: This does not look like a tar archive tar

ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

Suggestions

I don’t understand wait container details. For the final error message, I think we can improve by fixing the sh -c docker cp -a 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb:/tmp/outputs/Output/data - | gzip > /tmp/argo/outputs/artifacts/training-op-Output.tgz command.

Note, this script’s exit code is always the second command gzip > /tmp/...... It doesn’t matter the first command failed or not. This is a problem of shell when using pipe.

One workaround is using go to pipe command outputs: https://stackoverflow.com/a/10781582/8745218 and check exit code of both commands properly, so that we’ll return error message of the docker cp command when it fails. I think the file /tmp/outputs/Output/data doesn’t exist at the time this command was run.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 14
  • Comments: 22 (22 by maintainers)

Commits related to this issue

Most upvoted comments

Run

make test-executor E2E_EXECUTOR=docker