argo-workflows: Workflows with large/long scripts cause argument list too long error on init container

Summary

What happened/what you expected to happen?

When executing a workflow which contains a step configured as a script (e.g. Python, bash) with source code exceeding about 100000 characters, the pod cannot be started, with the following error message shown for the Argo init container:

standard_init_linux.go:228: exec user process caused: argument list too long

This is due to the fact that the ARG_MAX / MAX_ARG_STRLEN limit for a single command, hardcoded at 131072 bytes in most kernels, is being exceeded. From my understanding/analysis, the actual script size is not a problem (until you approach the 1MB etcd limit I guess), since this is mounted into the container by Argo. The problem is that an environment variable named ARGO_TEMPLATE, containing the entire pod spec, is being set for the init, wait, and actual workload container. I believe in Argo versions < 3.2, this has been handled differently using pod annotations and volumes and therefore was not a problem and has been changed with this commit: https://github.com/argoproj/argo-workflows/commit/cecc379ce23e708479e4253bbbf14f7907272c9c

Of course one could ask: “Why do you need such a long script?” and argue that this should be better split up and run in consecutive or parallel containers. However, this might not be easily possible for certain applications. It does put an absolute limit on the spec/configuration size of a single pod that can be launched by Argo.

What version of Argo Workflows are you running? This affects versions >= 3.2 and was tested on Argo v3.2.3 Not an issue in versions 3.1 for example

Diagnostics

Example workflow YAML is attached as .txt large_script.txt

What Kubernetes provider are you using?

Tested it on

  • Kubernetes v1.21.5 running on Docker Desktop 4.2.0 (70708), Engine: 20.10.10
  • Kubernetes v1.20.10 running on RKE, Engine 20.10.8 with same error

What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary

Tested it with executors

  • Docker
  • Emissary on above Docker Desktop Kubernetes with same error

Logs from the workflow controller:

time="2022-01-19T15:49:50.320Z" level=info msg="Processing workflow" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.357Z" level=info msg="Updated phase  -> Running" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.358Z" level=info msg="Steps node scripts-python-tls7z initialized Running" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.359Z" level=info msg="StepGroup node scripts-python-tls7z-295132961 initialized Running" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.370Z" level=info msg="Pod node scripts-python-tls7z-3584289076 initialized Pending" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.426Z" level=info msg="Created pod: scripts-python-tls7z[0].print-hello-world (scripts-python-tls7z-3584289076)" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.426Z" level=info msg="Workflow step group node scripts-python-tls7z-295132961 not yet completed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.426Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.426Z" level=info msg=reconcileAgentPod namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.469Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=1157453 workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.458Z" level=info msg="Processing workflow" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.465Z" level=info msg="Pod failed: Error (exit code 1)" displayName=print-hello-world namespace=default pod=scripts-python-tls7z-3584289076 templateName=print-hello-world workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.465Z" level=info msg="Updating node scripts-python-tls7z-3584289076 status Pending -> Error" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.465Z" level=info msg="Updating node scripts-python-tls7z-3584289076 message: Error (exit code 1)" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.470Z" level=info msg="Step group node scripts-python-tls7z-295132961 deemed failed: child 'scripts-python-tls7z-3584289076' failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.470Z" level=info msg="node scripts-python-tls7z-295132961 phase Running -> Failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="node scripts-python-tls7z-295132961 message: child 'scripts-python-tls7z-3584289076' failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="node scripts-python-tls7z-295132961 finished: 2022-01-19 15:50:00.4710189 +0000 UTC" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="step group scripts-python-tls7z-295132961 was unsuccessful: child 'scripts-python-tls7z-3584289076' failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Outbound nodes of scripts-python-tls7z-3584289076 is [scripts-python-tls7z-3584289076]" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Outbound nodes of scripts-python-tls7z is [scripts-python-tls7z-3584289076]" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="node scripts-python-tls7z phase Running -> Failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="node scripts-python-tls7z message: child 'scripts-python-tls7z-3584289076' failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="node scripts-python-tls7z finished: 2022-01-19 15:50:00.4711464 +0000 UTC" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Checking daemoned children of scripts-python-tls7z" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg=reconcileAgentPod namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Updated phase Running -> Failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Updated message  -> child 'scripts-python-tls7z-3584289076' failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Marking workflow completed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Marking workflow as pending archiving" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Checking daemoned children of " namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.515Z" level=info msg="Workflow update successful" namespace=default phase=Failed resourceVersion=1157478 workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.525Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/scripts-python-tls7z-3584289076/labelPodCompleted
time="2022-01-19T15:50:00.541Z" level=info msg="archiving workflow" namespace=default uid=d3f4afe4-3a8b-4631-b1ce-620dab93ecb4 workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.745Z" level=info msg="archiving workflow" namespace=default uid=d3f4afe4-3a8b-4631-b1ce-620dab93ecb4 workflow=scripts-python-tls7z

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 23
  • Comments: 23 (15 by maintainers)

Commits related to this issue

Most upvoted comments

Facing this issue with v3.3.8 as well, nothing out of the ordinary other than a particularly long input argument.

@alexec do you mean docs on ARG_MAX? I think this is the best option https://www.in-ulm.de/~mascheck/various/argmax/

@blkperl, yes, I think this is a duplicate of #7527, and it seems to be caused by the switch from Debian images to Alpine.

Is this a duplicate of #7527?

@alexec Can we add something to the upgrade guide for 3.2 if we plan to change the limit to 128kb instead of 256kb? We should also consider changing this: https://github.com/argoproj/argo-workflows/blob/4db1c4c8495d0b8e13c718207175273fe98555a2/workflow/executor/executor.go#L761-L766

@chazapis Any thoughts about https://github.com/argoproj/argo-workflows/commit/cecc379ce23e708479e4253bbbf14f7907272c9c causing a backwards compatibility change?

That is indeed a current version, so yes, please open a new issue. Please include a reproduction and the error message you’re getting.

Could we possibly reopen this? This is still an issue that exists. We’re trying to pass a (admittedly large) json containing initialization values for a map reduce workflow into a script template that is failing because of this.

@alexec I think the suggested patch will not work because ARG_MAX is the total size of all environment variables and arguments where originally we thought it was the max length of a single variable.

I think the only option is to finish the move to distroless which puts us back on debian.

Could you instead install the script as a file using artifacts and run that?