argo-workflows: Workflows with large/long scripts cause argument list too long error on init container
Summary
What happened/what you expected to happen?
When executing a workflow which contains a step configured as a script (e.g. Python, bash) with source code exceeding about 100000 characters, the pod cannot be started, with the following error message shown for the Argo init container:
standard_init_linux.go:228: exec user process caused: argument list too long
This is due to the fact that the ARG_MAX / MAX_ARG_STRLEN limit for a single command, hardcoded at 131072 bytes in most kernels, is being exceeded. From my understanding/analysis, the actual script size is not a problem (until you approach the 1MB etcd limit I guess), since this is mounted into the container by Argo. The problem is that an environment variable named ARGO_TEMPLATE, containing the entire pod spec, is being set for the init, wait, and actual workload container. I believe in Argo versions < 3.2, this has been handled differently using pod annotations and volumes and therefore was not a problem and has been changed with this commit: https://github.com/argoproj/argo-workflows/commit/cecc379ce23e708479e4253bbbf14f7907272c9c
Of course one could ask: “Why do you need such a long script?” and argue that this should be better split up and run in consecutive or parallel containers. However, this might not be easily possible for certain applications. It does put an absolute limit on the spec/configuration size of a single pod that can be launched by Argo.
What version of Argo Workflows are you running? This affects versions >= 3.2 and was tested on Argo v3.2.3 Not an issue in versions 3.1 for example
Diagnostics
Example workflow YAML is attached as .txt large_script.txt
What Kubernetes provider are you using?
Tested it on
- Kubernetes v1.21.5 running on Docker Desktop 4.2.0 (70708), Engine: 20.10.10
- Kubernetes v1.20.10 running on RKE, Engine 20.10.8 with same error
What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary
Tested it with executors
- Docker
- Emissary on above Docker Desktop Kubernetes with same error
Logs from the workflow controller:
time="2022-01-19T15:49:50.320Z" level=info msg="Processing workflow" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.357Z" level=info msg="Updated phase -> Running" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.358Z" level=info msg="Steps node scripts-python-tls7z initialized Running" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.359Z" level=info msg="StepGroup node scripts-python-tls7z-295132961 initialized Running" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.370Z" level=info msg="Pod node scripts-python-tls7z-3584289076 initialized Pending" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.426Z" level=info msg="Created pod: scripts-python-tls7z[0].print-hello-world (scripts-python-tls7z-3584289076)" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.426Z" level=info msg="Workflow step group node scripts-python-tls7z-295132961 not yet completed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.426Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.426Z" level=info msg=reconcileAgentPod namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:49:50.469Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=1157453 workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.458Z" level=info msg="Processing workflow" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.465Z" level=info msg="Pod failed: Error (exit code 1)" displayName=print-hello-world namespace=default pod=scripts-python-tls7z-3584289076 templateName=print-hello-world workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.465Z" level=info msg="Updating node scripts-python-tls7z-3584289076 status Pending -> Error" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.465Z" level=info msg="Updating node scripts-python-tls7z-3584289076 message: Error (exit code 1)" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.470Z" level=info msg="Step group node scripts-python-tls7z-295132961 deemed failed: child 'scripts-python-tls7z-3584289076' failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.470Z" level=info msg="node scripts-python-tls7z-295132961 phase Running -> Failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="node scripts-python-tls7z-295132961 message: child 'scripts-python-tls7z-3584289076' failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="node scripts-python-tls7z-295132961 finished: 2022-01-19 15:50:00.4710189 +0000 UTC" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="step group scripts-python-tls7z-295132961 was unsuccessful: child 'scripts-python-tls7z-3584289076' failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Outbound nodes of scripts-python-tls7z-3584289076 is [scripts-python-tls7z-3584289076]" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Outbound nodes of scripts-python-tls7z is [scripts-python-tls7z-3584289076]" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="node scripts-python-tls7z phase Running -> Failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="node scripts-python-tls7z message: child 'scripts-python-tls7z-3584289076' failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="node scripts-python-tls7z finished: 2022-01-19 15:50:00.4711464 +0000 UTC" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Checking daemoned children of scripts-python-tls7z" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg=reconcileAgentPod namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Updated phase Running -> Failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Updated message -> child 'scripts-python-tls7z-3584289076' failed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Marking workflow completed" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Marking workflow as pending archiving" namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.471Z" level=info msg="Checking daemoned children of " namespace=default workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.515Z" level=info msg="Workflow update successful" namespace=default phase=Failed resourceVersion=1157478 workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.525Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/scripts-python-tls7z-3584289076/labelPodCompleted
time="2022-01-19T15:50:00.541Z" level=info msg="archiving workflow" namespace=default uid=d3f4afe4-3a8b-4631-b1ce-620dab93ecb4 workflow=scripts-python-tls7z
time="2022-01-19T15:50:00.745Z" level=info msg="archiving workflow" namespace=default uid=d3f4afe4-3a8b-4631-b1ce-620dab93ecb4 workflow=scripts-python-tls7z
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 23
- Comments: 23 (15 by maintainers)
Commits related to this issue
- Change argoexec base image to debian Reverts back to previous argoexec baseimage of debian due to ARG_MAX limit differences in debian versus alpine causing a regression in behavior. reverts: PR #572... — committed to jfarrell/argo-workflows by jfarrell 2 years ago
- Change argoexec base image to debian Reverts back to previous argoexec baseimage of debian due to ARG_MAX limit differences in debian versus alpine causing a regression in behavior. reverts: PR #572... — committed to jfarrell/argo-workflows by jfarrell 2 years ago
- tests: Add test coverage for large workflows Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com> Part of: #7586 — committed to blkperl/argo-workflows by blkperl 2 years ago
- fix: Add better error message for large workflows Part of: #7586 Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com> — committed to blkperl/argo-workflows by blkperl 2 years ago
Facing this issue with v3.3.8 as well, nothing out of the ordinary other than a particularly long input argument.
@alexec do you mean docs on ARG_MAX? I think this is the best option https://www.in-ulm.de/~mascheck/various/argmax/
@blkperl, yes, I think this is a duplicate of #7527, and it seems to be caused by the switch from Debian images to Alpine.
Is this a duplicate of #7527?
@alexec Can we add something to the upgrade guide for 3.2 if we plan to change the limit to 128kb instead of 256kb? We should also consider changing this: https://github.com/argoproj/argo-workflows/blob/4db1c4c8495d0b8e13c718207175273fe98555a2/workflow/executor/executor.go#L761-L766
@chazapis Any thoughts about https://github.com/argoproj/argo-workflows/commit/cecc379ce23e708479e4253bbbf14f7907272c9c causing a backwards compatibility change?
That is indeed a current version, so yes, please open a new issue. Please include a reproduction and the error message you’re getting.
Could we possibly reopen this? This is still an issue that exists. We’re trying to pass a (admittedly large) json containing initialization values for a map reduce workflow into a script template that is failing because of this.
@alexec I think the suggested patch will not work because
ARG_MAX
is the total size of all environment variables and arguments where originally we thought it was the max length of a single variable.I think the only option is to finish the move to distroless which puts us back on debian.
Could you instead install the script as a file using artifacts and run that?