argo-workflows: Artifacts not set in Status.Nodes | ArtifactGC prevents Workflow deletion
Pre-requisites
- I have double-checked my configuration
- I can confirm the issues exists when I tested with
:latest
- I’d like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
This issue stems from #10840. I expect to be able to delete a workflow and have the garbage collection remove the artifact from the repository.
In this specific case, the workflow finalizer isn’t removed from the workflow due to a failure here, implying that the artifacts aren’t being written to Status.Nodes. Upon inspecting the workflow manifest’s Status.Nodes
, I can confirm that no output artifacts are listed. Artifacts are successfully being written to GCS upon triggering “stop” for the workflow.
Steps to Reproduce
- Have a workflow with a long-running process
- Stop the workflow mid-execution
- Artifacts are written to GCS
- Attempt to delete the workflow
- Notice that the workflow cannot be deleted since the finalizer can’t be removed & the artifacts are not deleted.
Partial Remedy While preparing this ticket, I notice that by adding at least one successful workflow and artifact to my DAG (in parallel), the workflow’s finalizer is successfully deleted & the successful DAG step’s artifacts are also successfully cleaned up, however, the failed (stopped) DAG step’s artifacts are not cleaned up. This can lead to dangling resources in GCS.
Suggested Fix
Ensure that stopped/failed workflow step’s output artifacts are written to Status.Nodes
Version
3.5.0-rc1
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don’t enter a workflows that uses private images.
# https://argoproj.github.io/argo-workflows/cron-workflows/
# https://argoproj.github.io/argo-workflows/fields/
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
name: TEMP
spec:
# Start up 5 minutes to 6 AM to ensure system is online.
schedule: "55 5 * * *"
timezone: America/New_York
concurrencyPolicy: "Replace"
startingDeadlineSeconds: 0
# Run the workflow for 16 hours, 10 minutes (between 5:55 AM and 10:05 PM)
activeDeadlineSeconds: 58200
workflowSpec:
entrypoint: test-dag
templates:
- name: test-dag
dag:
tasks:
- name: test
template: test
continueOn:
failed: true
# Clean up artifacts https://argoproj.github.io/argo-workflows/walk-through/artifacts/#artifact-garbage-collection
artifactGC:
strategy: OnWorkflowDeletion
serviceAccountName: argo-workflows-sa
forceFinalizerRemoval: true
# How long to keep workflows around after they have completed https://argoproj.github.io/argo-workflows/fields/#ttlstrategy
ttlStrategy:
# 5 minutes
secondsAfterCompletion: 300
secondsAfterSuccess: 300
secondsAfterFailure: 300
# a temporary volume, named workdir, will be used as a working directory for this workflow. This volume is passed
# around from step to step.
volumeClaimTemplates:
- metadata:
name: logs
spec:
accessModes: [ "ReadWriteMany" ]
resources:
requests:
storage: 2.5G
serviceAccountName: argo-workflows-sa
templates:
- name: test
volumes:
- name: logs
emptyDir: { }
container:
name: test
image: alpine:latest
command:
- sh
- '-c'
args:
- |
echo 'testing' > /tmp/test.txt
while true
do
# loop indefinitely
sleep 1
done
outputs:
artifacts:
- name: 'test'
path: /tmp/test.txt
archive:
none: { }
gcs:
# https://argoproj.github.io/argo-workflows/variables/
key: "{{workflow.name}}-{{workflow.creationTimestamp.Y}}-{{workflow.creationTimestamp.m}}-{{workflow.creationTimestamp.d}}_{{workflow.creationTimestamp.H}}-{{workflow.creationTimestamp.M}}-{{workflow.creationTimestamp.S}}/test.txt"
Logs from the workflow controller
time="2023-09-24T06:32:14.100Z" level=debug msg="processing Artifact GC Strategy OnWorkflowDeletion" namespace=pipeline-dev workflow=pipeline-cron-workflow-dev-9thnn
time="2023-09-24T06:32:14.100Z" level=debug msg="processing Artifact GC Strategy OnWorkflowDeletion" namespace=pipeline-dev workflow=pipeline-cron-workflow-dev-9thnn
time="2023-09-24T06:32:14.100Z" level=debug msg="No Artifact Search Results returned from strategy OnWorkflowDeletion" namespace=pipeline-dev workflow=pipeline-cron-workflow-dev-9thnn
time="2023-09-24T06:32:14.101Z" level=debug msg="Log changes patch: {\"status\":{\"artifactGCStatus\":{\"strategiesProcessed\":{\"OnWorkflowDeletion\":true}}}}"
time="2023-09-24T06:32:14.101Z" level=info msg="Workflow to be dehydrated" Workflow Size=4845
time="2023-09-24T06:32:14.113Z" level=debug msg="Update workflows 200"
time="2023-09-24T06:32:14.114Z" level=info msg="Workflow update successful" namespace=pipeline-dev phase=Failed resourceVersion=121464951 workflow=pipeline-cron-workflow-dev-9thnn
Logs from in your workflow’s wait container
time="2023-09-24T06:29:25.935Z" level=info msg="Starting Workflow Executor" version=v3.5.0-rc1
time="2023-09-24T06:29:25.941Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-09-24T06:29:25.941Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=pipeline-dev podName=pipeline-cron-workflow-dev-9thnn-test-2481151211 template="{\"name\":\"test\",\"inputs\":{},\"outputs\":{\"artifacts\":[{\"name\":\"test\",\"path\":\"/tmp/test.txt\",\"gcs\":{\"key\":\"pipeline-cron-workflow-dev-9thnn-2023-09-24_06-28-19/test.txt\"},\"archive\":{\"none\":{}}}]},\"metadata\":{},\"container\":{\"name\":\"test\",\"image\":\"alpine:latest\",\"command\":[\"sh\",\"-c\"],\"args\":[\"echo 'testing' \\u003e /tmp/test.txt\\nwhile true\\ndo\\n # loop indefinitely\\n sleep 1\\ndone\\n\"],\"resources\":{}},\"volumes\":[{\"name\":\"logs\",\"emptyDir\":{}}],\"archiveLocation\":{\"archiveLogs\":false,\"gcs\":{\"bucket\":\"project-workflow-artifacts\",\"key\":\"pipeline-cron-workflow-dev-9thnn/pipeline-cron-workflow-dev-9thnn-test-2481151211\"}}}" version="&Version{Version:v3.5.0-rc1,BuildDate:2023-08-15T18:44:17Z,GitCommit:6cd3e810e9c953e0a78070f41ab8f7d799fe51d3,GitTag:v3.5.0-rc1,GitTreeState:clean,GoVersion:go1.20.7,Compiler:gc,Platform:linux/amd64,}"
time="2023-09-24T06:29:25.942Z" level=info msg="Starting deadline monitor"
time="2023-09-24T06:31:19.023Z" level=info msg="Main container completed" error="<nil>"
time="2023-09-24T06:31:19.040Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-09-24T06:31:19.040Z" level=info msg="No output parameters"
time="2023-09-24T06:31:19.040Z" level=info msg="Saving output artifacts"
time="2023-09-24T06:31:19.045Z" level=info msg="Deadline monitor stopped"
time="2023-09-24T06:31:19.050Z" level=info msg="Staging artifact: test"
time="2023-09-24T06:31:19.052Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2023-09-24T06:31:19.058Z" level=info msg="Copying /tmp/test.txt from container base image layer to /tmp/argo/outputs/artifacts/test.tgz"
time="2023-09-24T06:31:19.065Z" level=info msg="/var/run/argo/outputs/artifacts/tmp/test.txt.tgz -> /tmp/argo/outputs/artifacts/test.tgz"
time="2023-09-24T06:31:19.080Z" level=info msg="Untaring /tmp/argo/outputs/artifacts/test.tgz archive before upload"
time="2023-09-24T06:31:19.105Z" level=info msg="GCS Save path: /tmp/argo/outputs/artifacts/test.txt, key: pipeline-cron-workflow-dev-9thnn-2023-09-24_06-28-19/test.txt"
time="2023-09-24T06:31:19.773Z" level=info msg="Save artifact" artifactName=test duration=668.368403ms error="<nil>" key=pipeline-cron-workflow-dev-9thnn-2023-09-24_06-28-19/test.txt
time="2023-09-24T06:31:19.773Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/artifacts/test.txt
time="2023-09-24T06:31:19.773Z" level=info msg="Successfully saved file: /tmp/argo/outputs/artifacts/test.txt"
time="2023-09-24T06:31:19.795Z" level=debug msg="Create workflowtaskresults 201"
time="2023-09-24T06:31:19.798Z" level=info msg="Alloc=26019 TotalAlloc=37269 Sys=47741 NumGC=5 Goroutines=10"
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 30 (30 by maintainers)
Commits related to this issue
- test: Add failing e2e test for stopped DAG workflow ArtifactGC. Towards #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 9 months ago
- test: Improve robustness of stopped DAG workflow ArtifactGC test. Towards #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 9 months ago
- test: Move task reconciliation in operator.go to before garbage collection to ensure that stopped workflow artifacts are garbage collected. Remove bug in e2e testing WaitForWorkflow() timeout option i... — committed to GarettSoftware/argo-workflows by Garett-MacGowan 9 months ago
- test: Add new E2E_WAIT_TIMEOUT_BIAS environment varible to enable a bias for test timeouts in slow environments. Adjust WaitForWorkflow timeout parameter in retry_test.go (added 30 seconds) to be in l... — committed to GarettSoftware/argo-workflows by Garett-MacGowan 9 months ago
- test: Minor refactors to TestStoppedWorkflow ArtifactGC test. Fixes #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 9 months ago
- test: Adjust TestStoppedWorkflow test to prevent potential unwanted workflow deletions during testing (in case tests are running in parallel). Fixes #11879 Signed-off-by: Garett MacGowan <garettsoftw... — committed to GarettSoftware/argo-workflows by Garett-MacGowan 9 months ago
- fix: Add new WorkflowTaskResult label to prevent race condition in ArtifactGC. Fixes #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
- fix: Add new WorkflowTaskResult label to prevent race condition in ArtifactGC. Fixes #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
- test: Fix ArtifactGC tests. Add tests for refactor. Fixes #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
- fix: Add state to ArtGCStatus to keep track of completed task results. Towards #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
- fix: Undo context rework in wait container since it did not have the desired effect. Towards #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
- fix: Move TaskResultsCompleted from ArtGCStatus to WorkflowStatus. Simplify LabelKeyReportOutputsCompleted label setting logic. Add support for legacy/insecure pod patch for ReportOutputsCompleted ann... — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
- fix: Refactors to address PR comments. Add context.Background() to wait container defer methods to ensure they complete. Fixes #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
- test: Fix ReportOutputs test. Fixes #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
- fix: Run linter. Towards #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
- test: Make TestMetricGC more robust. Towards #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
- test: rerun tests. Towards #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
- test: rerun tests. Towards #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
- fix: Move ArtifactGC to before workflow level synchronization lock. Towards #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
- test: rerun tests. Towards #11879 Signed-off-by: Garett MacGowan <garettsoftware@gmail.com> — committed to GarettSoftware/argo-workflows by Garett-MacGowan 8 months ago
Ok, I may have figured it out! @agilgur5 @juliev0
First, I figured out how artifacts are communicated to the controller.
NodeResult
is used in the executor’sreportResult()
function to create aWorkflowTaskResult
resource. This resource is polled by the controller intaskResultReconciliation()
.It turns out that the
NodeResult
upserted by the workflow executor has all the expected artifact information present (bothon-deletion-wf-stopped
andmain-logs
), but logs from thetaskResultReconciliation()
function indicate that only themain-logs
artifacts were received by the controller.After snooping around the controller, I noticed the section:
When a workflow is stopped, the label is set to
workflows.argoproj.io/completed: 'true'
. Since this line occurs before the task reconciliation, it preventstaskResultReconciliation()
from updatingStatus.Nodes
with the artifacts.I’ve moved the abort to after the reconciliation, and it seems to have worked. Going to mess around with it some more to make sure everything is working as expected.
In terms of the transient
level=fatal msg="open /var/run/argo/outputs/artifacts/tmp/test.txt.tgz: no such file or directory"
, I think this may have been an issue with the timing of thestop workflow
call in the test case. Will test this out more, too.Hopefully I’ll have a PR up soon.
Great, thank you. Have you seen this page?: https://argoproj.github.io/argo-workflows/running-locally/
if you have questions let us know.
How do you mean “muxing”? The individual logs are in the
logs
directory. So, I’ve basically been tailinglogs/controller.log
. Although, it’s not obvious to look there, so I was confused at first.Yes, I would agree. It doesn’t seem like he’s working on it anyway, and now that he’s not a lead he may not even object.
Sorry. This is kind of hacky but what I did for this was change this line locally to:
There’s a bug in this project that it’s not using the
ARGO_LOG_LEVEL
environment variable. I mentioned it awhile back to Alex but nothing happened.(I know this is not well publicized.) Have you run into this as well @agilgur5 ? I imagine it must be frustrating a lot of people.
@juliev0 & @agilgur5
WIP update
Findings
As I work through this, I’m noticing the issue seems to be related only to stopping the workflow (a failure alone doesn’t do the trick). I’ve got a test case that reproduces the error, so now I’m implementing the fix. Since there’s no test cases for a step failure with GC, I’ll implement one for that too, to be sure.
My contribution experience + blocks
Jetbrains Dev Container
A note in the setup docs to avoid using
Dev Container
with Jetbrains IDEs might be nice since the Jetbrains DevContainer plugin doesn’t supportfeatures
in thedevcontainer.json
yet. This is used to install Go, Node, docker-in-docker, and python in the container. See this ticket for tracking of that issue. I wasted a bunch of time trying to figure this out since I initially didn’t read the plugin notes fully.Dev Container argoexec-image failure loop
Once I finally got the Dev Container working on VS Code, I still ran into an issue where the container would boot up but require a restart of VS Code to enter the environment. It would otherwise enter a loop where one step would continually fail.
Once I restart VS Code, it seems to work fine, but I’m not certain that this error has no consequences…
Multiple terminals to segregate UI and Controller
I don’t think using one terminal for the controller + server and another for the UI works…. I’m not seeing any reference to CTRL in the
make
file. When I try to run the command for starting the UI alone (make start CTRL=false
) fails since it tries to boot up the controller again (it is already running).No debug logs for workflow controller
It seems that there is no loglevel=debug for the workflow controller as my Dev Container debug logs (enabled by 4+enter) don’t show the workflow debug logs.
When I add
to
test/e2e/manifests/mixins/workflow-controller-deployment.yaml
, debug logs still don’t show.Oh, okay. Yeah, in that case you can take it out of “draft”. There are unfortunately intermittent test failures with some of the e2e tests.
I was wrong with respect to the intermittent errors & artifacts not uploading. When intermittent
level=fatal msg="open /var/run/argo/outputs/artifacts/tmp/test.txt.tgz: no such file or directory"
error occurs, the artifacts do not upload.Even so, when the artifacts to get uploaded, they do not get garbage collected.
This issue feels very much like a race condition. I’m having a lot of trouble figuring it out. I’m not familiar with GO & don’t have a good grasp of the architecture. I know that the workflow holds the state, but i’m finding it difficult to trace.
Quick update: it turns out that outputs aren’t present in
taskNode
for stopped tasks. The problem is upstream somewhere. I’ll likely still have to implement the fix mentioned above, but it may only be part of the solution.Thanks for your efforts, analysis, & feedback @Garett-MacGowan! ❤️
Yea devcontainers are still pretty new, though I am surprised that Jetbrains doesn’t just immediately throw an error or warning if it detects
features
, which it apparently doesn’t support.Would you like to submit a PR to add a
!!! Warning
on Jetbrains use? I’ve used both in VSCode and with the CLI, which are both explicitly mentioned in the docs.Nowadays I just use the devcontainers CLI so that it doesn’t affect my editor. I’m pretty used to developing with Docker/Compose CLI and headless Vagrant, so that was pretty familiar to me.
I have seen it occasionally fail on random steps, and indeed restarting does the trick. Not really clear why it’s failing unfortunately 😕
Oh good find! I’ve never tried that before. I think this section probably pre-dates the usage of
kit
in the build system. There have been some efforts to replace it with something more standard, that being said. Should probably remove that section if it’s straight up incorrect right now – can put that in the same docs PR if you’d like to do that!@juliev0 I’ve seen that page, thanks. I will work through dev setup tomorrow.
@agilgur5 Thanks! Let me know if your look around results in anything I should know.
If you see no problems with it, I’ll implement a change so that
steps
andDAG
update the workflow’sstatus.nodes
outputs irrespective ofwfv1.NodeError, wfv1.NodeFailed
.That’s good investigation. I don’t see any reason why Outputs shouldn’t be set in that case. @agilgur5 do you see any reason?
@Garett-MacGowan would you be able to try seeing if you can fix this? I don’t have much bandwidth and really haven’t worked myself with this particular part of the code.
@juliev0 I took a peek at the source code. I have a hypothesis that I’m hoping you can shine some light on. Perhaps it can expedite your investigation when you get around to it.
My best guess is that the code execution never gets to this line here in
workflow/controller/dag.go
. I think this segment is responsible for setting the artifacts instatus.nodes
.I think the execution never reaches the above due, perhaps, to this line here in
workflow/controller/dag.go
, which returns early because ofcase wfv1.NodeError, wfv1.NodeFailed:
. After inspecting the workflow’s status.nodes.node.phase, it seems thatwfv1.NodeFailed
is set when the workflow is stopped.The solution may be to set outputs in
status.nodes
irrespective ofcase wfv1.NodeError, wfv1.NodeFailed:
. Is there any reason why this shouldn’t be done?The same issue seems to exist in
workflow/controller/steps.go
here.Thanks for the help!
Okay, ultimately the root cause is outside of the ArtifactGC logic itself, but nonetheless I can try to look at this when I get time.
As promised, here’s the new ticket dedicated to my issue @juliev0