argo-workflows: Artifacts not set in Status.Nodes | ArtifactGC prevents Workflow deletion

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I’d like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

This issue stems from #10840. I expect to be able to delete a workflow and have the garbage collection remove the artifact from the repository.

In this specific case, the workflow finalizer isn’t removed from the workflow due to a failure here, implying that the artifacts aren’t being written to Status.Nodes. Upon inspecting the workflow manifest’s Status.Nodes, I can confirm that no output artifacts are listed. Artifacts are successfully being written to GCS upon triggering “stop” for the workflow.

Steps to Reproduce

  1. Have a workflow with a long-running process
  2. Stop the workflow mid-execution
  3. Artifacts are written to GCS
  4. Attempt to delete the workflow
  5. Notice that the workflow cannot be deleted since the finalizer can’t be removed & the artifacts are not deleted.

Partial Remedy While preparing this ticket, I notice that by adding at least one successful workflow and artifact to my DAG (in parallel), the workflow’s finalizer is successfully deleted & the successful DAG step’s artifacts are also successfully cleaned up, however, the failed (stopped) DAG step’s artifacts are not cleaned up. This can lead to dangling resources in GCS.

Suggested Fix Ensure that stopped/failed workflow step’s output artifacts are written to Status.Nodes

Version

3.5.0-rc1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don’t enter a workflows that uses private images.

# https://argoproj.github.io/argo-workflows/cron-workflows/
# https://argoproj.github.io/argo-workflows/fields/
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: TEMP
spec:
  # Start up 5 minutes to 6 AM to ensure system is online.
  schedule: "55 5 * * *"
  timezone: America/New_York
  concurrencyPolicy: "Replace"
  startingDeadlineSeconds: 0
  # Run the workflow for 16 hours, 10 minutes (between 5:55 AM and 10:05 PM)
  activeDeadlineSeconds: 58200
  workflowSpec:
    entrypoint: test-dag
    templates:
      - name: test-dag
        dag:
          tasks:
            - name: test
              template: test
              continueOn:
                failed: true
    # Clean up artifacts https://argoproj.github.io/argo-workflows/walk-through/artifacts/#artifact-garbage-collection
    artifactGC:
      strategy: OnWorkflowDeletion
      serviceAccountName: argo-workflows-sa
      forceFinalizerRemoval: true
    # How long to keep workflows around after they have completed https://argoproj.github.io/argo-workflows/fields/#ttlstrategy
    ttlStrategy:
      # 5 minutes
      secondsAfterCompletion: 300
      secondsAfterSuccess: 300
      secondsAfterFailure: 300
    # a temporary volume, named workdir, will be used as a working directory for this workflow. This volume is passed
    # around from step to step.
    volumeClaimTemplates:
      - metadata:
          name: logs
        spec:
          accessModes: [ "ReadWriteMany" ]
          resources:
            requests:
              storage: 2.5G
    serviceAccountName: argo-workflows-sa
    templates:
      - name: test
        volumes:
          - name: logs
            emptyDir: { }
        container:
          name: test
          image: alpine:latest
          command:
            - sh
            - '-c'
          args:
            - |
              echo 'testing' > /tmp/test.txt
              while true
              do
                # loop indefinitely
                sleep 1
              done
        outputs:
          artifacts:
            - name: 'test'
              path: /tmp/test.txt
              archive:
                none: { }
              gcs:
                # https://argoproj.github.io/argo-workflows/variables/
                key: "{{workflow.name}}-{{workflow.creationTimestamp.Y}}-{{workflow.creationTimestamp.m}}-{{workflow.creationTimestamp.d}}_{{workflow.creationTimestamp.H}}-{{workflow.creationTimestamp.M}}-{{workflow.creationTimestamp.S}}/test.txt"

Logs from the workflow controller

time="2023-09-24T06:32:14.100Z" level=debug msg="processing Artifact GC Strategy OnWorkflowDeletion" namespace=pipeline-dev workflow=pipeline-cron-workflow-dev-9thnn
time="2023-09-24T06:32:14.100Z" level=debug msg="processing Artifact GC Strategy OnWorkflowDeletion" namespace=pipeline-dev workflow=pipeline-cron-workflow-dev-9thnn
time="2023-09-24T06:32:14.100Z" level=debug msg="No Artifact Search Results returned from strategy OnWorkflowDeletion" namespace=pipeline-dev workflow=pipeline-cron-workflow-dev-9thnn
time="2023-09-24T06:32:14.101Z" level=debug msg="Log changes patch: {\"status\":{\"artifactGCStatus\":{\"strategiesProcessed\":{\"OnWorkflowDeletion\":true}}}}"
time="2023-09-24T06:32:14.101Z" level=info msg="Workflow to be dehydrated" Workflow Size=4845
time="2023-09-24T06:32:14.113Z" level=debug msg="Update workflows 200"
time="2023-09-24T06:32:14.114Z" level=info msg="Workflow update successful" namespace=pipeline-dev phase=Failed resourceVersion=121464951 workflow=pipeline-cron-workflow-dev-9thnn

Logs from in your workflow’s wait container

time="2023-09-24T06:29:25.935Z" level=info msg="Starting Workflow Executor" version=v3.5.0-rc1
time="2023-09-24T06:29:25.941Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-09-24T06:29:25.941Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=pipeline-dev podName=pipeline-cron-workflow-dev-9thnn-test-2481151211 template="{\"name\":\"test\",\"inputs\":{},\"outputs\":{\"artifacts\":[{\"name\":\"test\",\"path\":\"/tmp/test.txt\",\"gcs\":{\"key\":\"pipeline-cron-workflow-dev-9thnn-2023-09-24_06-28-19/test.txt\"},\"archive\":{\"none\":{}}}]},\"metadata\":{},\"container\":{\"name\":\"test\",\"image\":\"alpine:latest\",\"command\":[\"sh\",\"-c\"],\"args\":[\"echo 'testing' \\u003e /tmp/test.txt\\nwhile true\\ndo\\n  # loop indefinitely\\n  sleep 1\\ndone\\n\"],\"resources\":{}},\"volumes\":[{\"name\":\"logs\",\"emptyDir\":{}}],\"archiveLocation\":{\"archiveLogs\":false,\"gcs\":{\"bucket\":\"project-workflow-artifacts\",\"key\":\"pipeline-cron-workflow-dev-9thnn/pipeline-cron-workflow-dev-9thnn-test-2481151211\"}}}" version="&Version{Version:v3.5.0-rc1,BuildDate:2023-08-15T18:44:17Z,GitCommit:6cd3e810e9c953e0a78070f41ab8f7d799fe51d3,GitTag:v3.5.0-rc1,GitTreeState:clean,GoVersion:go1.20.7,Compiler:gc,Platform:linux/amd64,}"
time="2023-09-24T06:29:25.942Z" level=info msg="Starting deadline monitor"
time="2023-09-24T06:31:19.023Z" level=info msg="Main container completed" error="<nil>"
time="2023-09-24T06:31:19.040Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-09-24T06:31:19.040Z" level=info msg="No output parameters"
time="2023-09-24T06:31:19.040Z" level=info msg="Saving output artifacts"
time="2023-09-24T06:31:19.045Z" level=info msg="Deadline monitor stopped"
time="2023-09-24T06:31:19.050Z" level=info msg="Staging artifact: test"
time="2023-09-24T06:31:19.052Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2023-09-24T06:31:19.058Z" level=info msg="Copying /tmp/test.txt from container base image layer to /tmp/argo/outputs/artifacts/test.tgz"
time="2023-09-24T06:31:19.065Z" level=info msg="/var/run/argo/outputs/artifacts/tmp/test.txt.tgz -> /tmp/argo/outputs/artifacts/test.tgz"
time="2023-09-24T06:31:19.080Z" level=info msg="Untaring /tmp/argo/outputs/artifacts/test.tgz archive before upload"
time="2023-09-24T06:31:19.105Z" level=info msg="GCS Save path: /tmp/argo/outputs/artifacts/test.txt, key: pipeline-cron-workflow-dev-9thnn-2023-09-24_06-28-19/test.txt"
time="2023-09-24T06:31:19.773Z" level=info msg="Save artifact" artifactName=test duration=668.368403ms error="<nil>" key=pipeline-cron-workflow-dev-9thnn-2023-09-24_06-28-19/test.txt
time="2023-09-24T06:31:19.773Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/artifacts/test.txt
time="2023-09-24T06:31:19.773Z" level=info msg="Successfully saved file: /tmp/argo/outputs/artifacts/test.txt"
time="2023-09-24T06:31:19.795Z" level=debug msg="Create workflowtaskresults 201"
time="2023-09-24T06:31:19.798Z" level=info msg="Alloc=26019 TotalAlloc=37269 Sys=47741 NumGC=5 Goroutines=10"

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 30 (30 by maintainers)

Commits related to this issue

Most upvoted comments

Ok, I may have figured it out! @agilgur5 @juliev0

First, I figured out how artifacts are communicated to the controller. NodeResult is used in the executor’s reportResult() function to create a WorkflowTaskResult resource. This resource is polled by the controller in taskResultReconciliation().

It turns out that the NodeResult upserted by the workflow executor has all the expected artifact information present (both on-deletion-wf-stopped and main-logs), but logs from the taskResultReconciliation() function indicate that only the main-logs artifacts were received by the controller.

After snooping around the controller, I noticed the section:

if woc.wf.Labels[common.LabelKeyCompleted] == "true" { // abort now, we do not want to perform any more processing on a complete workflow because we could corrupt it
	return
}

When a workflow is stopped, the label is set to workflows.argoproj.io/completed: 'true'. Since this line occurs before the task reconciliation, it prevents taskResultReconciliation() from updating Status.Nodes with the artifacts.

I’ve moved the abort to after the reconciliation, and it seems to have worked. Going to mess around with it some more to make sure everything is working as expected.


In terms of the transient level=fatal msg="open /var/run/argo/outputs/artifacts/tmp/test.txt.tgz: no such file or directory", I think this may have been an issue with the timing of the stop workflow call in the test case. Will test this out more, too.

Hopefully I’ll have a PR up soon.

Great, thank you. Have you seen this page?: https://argoproj.github.io/argo-workflows/running-locally/

if you have questions let us know.

This specific log level issue, maybe? tbh, I have been super confused by the logs from kit so have not been relying on them much. I thought that the numbers were to view a specific task’s logs for the longest time – literally earlier this week I realized I had misread it 😖 kit muxing the logs together is not what I expected

How do you mean “muxing”? The individual logs are in the logs directory. So, I’ve basically been tailing logs/controller.log. Although, it’s not obvious to look there, so I was confused at first.

It’s been mentioned a few times to replace kit. It is non-standard and Alex hasn’t actively worked on it since its release; Workflows seems like the only user of it? In go-land, Taskfile is the popular tool. Also these things are doable within make itself. I’m not the biggest fan of make (mask is a more readable alternative and Earthly is my current go-to that I mentioned on Slack), but having two task runners adds another layer 😕

Yes, I would agree. It doesn’t seem like he’s working on it anyway, and now that he’s not a lead he may not even object.

@agilgur5 or @juliev0 could either of you provide any insights into why I might not be seeing debug logs? it’s making it hard to debug.

Sorry. This is kind of hacky but what I did for this was change this line locally to:

command: ./dist/workflow-controller --loglevel debug

There’s a bug in this project that it’s not using the ARGO_LOG_LEVEL environment variable. I mentioned it awhile back to Alex but nothing happened.

(I know this is not well publicized.) Have you run into this as well @agilgur5 ? I imagine it must be frustrating a lot of people.

@juliev0 & @agilgur5

WIP update

Findings

As I work through this, I’m noticing the issue seems to be related only to stopping the workflow (a failure alone doesn’t do the trick). I’ve got a test case that reproduces the error, so now I’m implementing the fix. Since there’s no test cases for a step failure with GC, I’ll implement one for that too, to be sure.

My contribution experience + blocks

Jetbrains Dev Container

A note in the setup docs to avoid using Dev Container with Jetbrains IDEs might be nice since the Jetbrains DevContainer plugin doesn’t support features in the devcontainer.json yet. This is used to install Go, Node, docker-in-docker, and python in the container. See this ticket for tracking of that issue. I wasted a bunch of time trying to figure this out since I initially didn’t read the plugin notes fully.

Dev Container argoexec-image failure loop

Once I finally got the Dev Container working on VS Code, I still ran into an issue where the container would boot up but require a restart of VS Code to enter the environment. It would otherwise enter a loop where one step would continually fail.

executor: ERROR: failed to solve: process "/bin/sh -c make dist/argoexec GIT_COM
MIT=${GIT_COMMIT} GIT_TAG=${GIT_TAG} GIT_TREE_STATE=${GIT_TREE_STATE}" did not c
omplete successfully: exit code: 2
executor: make: *** [Makefile:244: argoexec-image] Error 1

Once I restart VS Code, it seems to work fine, but I’m not certain that this error has no consequences…

Multiple terminals to segregate UI and Controller

I don’t think using one terminal for the controller + server and another for the UI works…. I’m not seeing any reference to CTRL in the make file. When I try to run the command for starting the UI alone (make start CTRL=false) fails since it tries to boot up the controller again (it is already running).

No debug logs for workflow controller

It seems that there is no loglevel=debug for the workflow controller as my Dev Container debug logs (enabled by 4+enter) don’t show the workflow debug logs.

When I add

args:
    - --loglevel=debug

to test/e2e/manifests/mixins/workflow-controller-deployment.yaml, debug logs still don’t show.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: workflow-controller
spec:
  replicas: 0
  template:
    spec:
      containers:
        - name: workflow-controller
          imagePullPolicy: Never
          args:
            - --loglevel=debug

No, it’s in draft since there are test failures. I don’t think the failures are to do with my PR though.

Oh, okay. Yeah, in that case you can take it out of “draft”. There are unfortunately intermittent test failures with some of the e2e tests.

I was wrong with respect to the intermittent errors & artifacts not uploading. When intermittent level=fatal msg="open /var/run/argo/outputs/artifacts/tmp/test.txt.tgz: no such file or directory" error occurs, the artifacts do not upload.

Even so, when the artifacts to get uploaded, they do not get garbage collected.

This issue feels very much like a race condition. I’m having a lot of trouble figuring it out. I’m not familiar with GO & don’t have a good grasp of the architecture. I know that the workflow holds the state, but i’m finding it difficult to trace.

Quick update: it turns out that outputs aren’t present in taskNode for stopped tasks. The problem is upstream somewhere. I’ll likely still have to implement the fix mentioned above, but it may only be part of the solution.

controller: time="2023-09-29T17:08:16.606Z" level=debug msg="taskNode: &NodeStatus{ID:artgc-dag-wf-stopped-nnb28-1981691075,Name:artgc-dag-wf-stopped-nnb28.create-artifact,DisplayName:create-artifact,Type:Pod,TemplateName:artgc-dag-artifact-creator,TemplateRef:nil,Phase:Failed,BoundaryID:artgc-dag-wf-stopped-nnb28,Message:workflow shutdown with strategy:  Stop,StartedAt:2023-09-29 17:07:55 +0000 UTC,FinishedAt:2023-09-29 17:08:16.6060254 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[],OutboundNodes:[],TemplateScope:local/artgc-dag-wf-stopped-nnb28,ResourcesDuration:ResourcesDuration{},HostNodeName:k3d-k3s-default-server-0,MemoizationStatus:nil,EstimatedDuration:0,SynchronizationStatus:nil,Progress:0/1,} \n" namespace=argo workflow=artgc-dag-wf-stopped-nnb28

Thanks for your efforts, analysis, & feedback @Garett-MacGowan! ❤️

Jetbrains Dev Container

Yea devcontainers are still pretty new, though I am surprised that Jetbrains doesn’t just immediately throw an error or warning if it detects features, which it apparently doesn’t support.

Would you like to submit a PR to add a !!! Warning on Jetbrains use? I’ve used both in VSCode and with the CLI, which are both explicitly mentioned in the docs.

Nowadays I just use the devcontainers CLI so that it doesn’t affect my editor. I’m pretty used to developing with Docker/Compose CLI and headless Vagrant, so that was pretty familiar to me.

Dev Container argoexec-image failure loop

I have seen it occasionally fail on random steps, and indeed restarting does the trick. Not really clear why it’s failing unfortunately 😕

Multiple terminals to segregate UI and Controller

Oh good find! I’ve never tried that before. I think this section probably pre-dates the usage of kit in the build system. There have been some efforts to replace it with something more standard, that being said. Should probably remove that section if it’s straight up incorrect right now – can put that in the same docs PR if you’d like to do that!

@juliev0 I’ve seen that page, thanks. I will work through dev setup tomorrow.

@agilgur5 Thanks! Let me know if your look around results in anything I should know.

If you see no problems with it, I’ll implement a change so that steps and DAG update the workflow’s status.nodes outputs irrespective of wfv1.NodeError, wfv1.NodeFailed.

here

That’s good investigation. I don’t see any reason why Outputs shouldn’t be set in that case. @agilgur5 do you see any reason?

@Garett-MacGowan would you be able to try seeing if you can fix this? I don’t have much bandwidth and really haven’t worked myself with this particular part of the code.

@juliev0 I took a peek at the source code. I have a hypothesis that I’m hoping you can shine some light on. Perhaps it can expedite your investigation when you get around to it.

My best guess is that the code execution never gets to this line here in workflow/controller/dag.go. I think this segment is responsible for setting the artifacts in status.nodes.

I think the execution never reaches the above due, perhaps, to this line here in workflow/controller/dag.go, which returns early because of case wfv1.NodeError, wfv1.NodeFailed:. After inspecting the workflow’s status.nodes.node.phase, it seems that wfv1.NodeFailed is set when the workflow is stopped.

The solution may be to set outputs in status.nodes irrespective of case wfv1.NodeError, wfv1.NodeFailed:. Is there any reason why this shouldn’t be done?

The same issue seems to exist in workflow/controller/steps.go here.

Thanks for the help!

Okay, ultimately the root cause is outside of the ArtifactGC logic itself, but nonetheless I can try to look at this when I get time.

As promised, here’s the new ticket dedicated to my issue @juliev0