pipeline: Design: Failure Strategy for TaskRuns in a PipelineRun

The goal is to come up with a design to handle failing task runs in a pipelinerun. Today, we simply fail the entire pipelinerun if a single taskrun fails.

Current Status

Summary in this comment: https://github.com/tektoncd/pipeline/issues/1684#issuecomment-611016087

Ideas

Here are a couple of ideas from @sbwsg and me:

  1. Introduce an errorStrategy field in PipelineTasks similar to the idea in #1573
  2. The errorStrategy could be under the runAfter field.
  3. To start off, we could have two error strategies : FailPipeline which is the default for today, and ContinuePipeline which will continue running the whole pipeline
  4. Later on, we could add branch based error strategies e.g. fail one one branch of the graph but continue running the remaining pipelines

Additional Info

@sbwsg has some strawperson YAMLs: RunNextTasks for an integration test cleanup scenario FailPipeline(default) for a unit test failing before a deploy task

Use Cases

  • Unit test fails but integration still run
  • Rollbacks for CD e.g. Canaries - rollback if canary analysis fails
  • Cleanup task if integration test fails
  • Always run a step/task at the end e.g. to Report results
  • Run on conditional failures #1023

Related Issues

The Epic #1376 has all the related issues

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 4
  • Comments: 56 (33 by maintainers)

Most upvoted comments

One idea - instead of failure/error/executionStrategy, we could have a field like runOn (or simply on or when) that takes in a list of states that the parent taskruns have to be in for it to run (default is: success):

- name: task1
  conditions:
    conditionRef: "condition-that-sometime-fails"
  taskRef: { name: "my-task" }

- name: runIfTask1Fails 
  runAfter: task1
  runOn: ["failure"]

- name: runIfTask1Succeeds
  runAfter: task1
  runOn: ["success"]

- name: runIfTask1IsSkipped
  runAfter: task1
  runOn: ["skip"]

What I like about this is that the field name is more succinct and for the user instead of having to remember a bunch of magic strings (is it RunOnSuccessOrSkip or RunOnSkipAndSuccess ), they just need to remember the 3 taskrun states e.g. “success”, “failure”, “skip”

A few more examples here: https://gist.github.com/dibyom/92dfd6ea20f13f5c769a21389df53977

Status update: We are currently punting on the runOn syntax. Instead, we are:

  1. Implementing a pipeline level finally field that always runs some tasks at the end of a pipeline. (doc

We are also considering adding the following (discussions ongoing in the API working group):

  1. A pipeline level onError/except that runs a Task if any task in a pipeline fails
  2. A finally/onError Step within a Task or within a PipelineTask
  3. Allowing a pipeline to call other pipelines aka nested/sub pipelines for complex branches

Some discussion here

I feel like it’s fair to consider this closed now that we have finally, tho there are more features to add, and to get the complete set of flexibility someone might want, i think we need to add in #2134 as well

hey folks - this is one of the key requirements for the work we are leading from Kubeflow side to run on top of Tekton. Would be great to get current status, and see how we can accelerate this

@pritidesai is implementing a finally field for always running tasks at the end of a pipeline. Hope that helps with some of your use cases. Beyond that, we are considering:

  1. A “finally” step with a task
  2. An “onError” step/task that runs only if the pipeline/task fails.
  3. Allowing a pipeline to call other pipelines aka nested/sub pipelines for complex branches

Would this be sufficient for the Kubeflow use cases?

hey folks - this is one of the key requirements for the work we are leading from Kubeflow side to run on top of Tekton. Would be great to get current status, and see how we can accelerate this

cc @tomcli @afrittoli @skaegi

@pritidesai @bigkevmcd @pierretasci Thanks a lot for adding such detailed use cases! Very very helpful 🙏

  • @pritidesai For your use case – the current behavior for conditionals is that if a task is skipped, it dependents (identified using the runAfter and from fields) are automatically skipped. The overall pipelinerun status will be determined from the status of the non-skipped tasks. And the default and only strategy today is the RunOnSuccess. Though I guess there could be strategies such as RunOnSuccessOrSkip or RunAlways which can be combined with conditionals for more complex pipelines.

  • @bigkevmcd Updating status is definitely a very important use case:

    • top level runAfter - this is the pipeline level finally use case. It seems like we’d have to add something like this. The alternative would be to have one task that has runAfters set so that it runs after all other tasks and a strategy set to RunAlways. This can be unwieldy since anytime you add a new Task to the pipeline, you’d have to manually make sure that the task is still that last thing that executes.

    • errorStrategy containing a taskRef - this is interesting! And in some ways more descriptive than adding a generic task with a runAfter and a errorStategy: RunOnFailure. Are there other benefits? One thing I like about keeping the taskRefs separate is that then we can have multiple tasks that can run/be chained together (e.g. you can have both a cleanup-test-env task as well as a update-github-task that runs when the test fails

    • On passing status to tasks – we had a proposal in https://github.com/tektoncd/pipeline/issues/1020 though the current way of doing so is to pass in the pipelineRun name and then using kubectl within the task to fetch the status. (I think @afrittoli might also be doing something here re: Notifications design work)

  • @pierretasci Sounds like the RunAlways strategy is what you’d need for the upload-test-results-step in your example. I do like the idea of using conditionals as sort of the extension mechanism for more complicated strategies – the basic strategies such as RunAlways, RunOnSuccess/Failure/Skip etc. are built-in while a user can use those plus a conditional to describe complex strategies (e.g. a strategy of RunAlways plus a conditional for if two of the three tasks failed or whatever)

One specific use case that I don’t think has been explicitly mentioned above is in a fan-in/out scenario.

For example, if my pipeline is

apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: sharded-tests
spec:
  tasks:
    - name: pre-work
      taskRef:
        name: pre-work-step
    - name: run-tests-shard-1
      taskRef:
        name: golang-test
      params:
        - name: SHARD_SPEC
          value: 1
      runAfter: ["pre-work"]
    - name: run-tests-shard-2
      taskRef:
        name: golang-test
      params:
        - name: SHARD_SPEC
          value: 2
      runAfter: ["pre-work"]
    - name: upload-test-results
      taskRef:
        name: upload-test-results-step
      runAfter: ["run-tests-shard-1, run-tests-shard-2"]

Here, I would want to always run the upload-test-results task regardless of whether 0, 1, or both of the tasks preceding it failed.

To me, this reads a lot like conditional execution but more like, conditional failure. Perhaps, this could be served as an extension to the conditions that already exist. If you wanted to “always execute B after A” your condition could simply always return true to override the default behavior of “execute B after A if A is successful”

Another slightly different case:

For things like updating GitHub status notifications it would be nice if we could do something like the following…admittedly this is a bit repetitive, but passing the “success” or “failure” of a task might work with the “recover” strategy mentioned earlier, which would mean that after each task, somehow it’d use the success/failure of the previous task to update the GitHub status appropriately.

Updating these kinds of statuses would be really useful if you want your pipeline to determine whether or not a commit can be merged (if you’re not familiar with these, you can require specific contexts to be successful before a PR can be merged).

This also adds a runAfter pipeline-scoped taskRef, which could do the cleanup in a “Go defer” way, i.e. always after the pipeline has ended, irrespective of how what caused it to end.

The example below would trigger two parallel executions (lint and tests), which would report in their status to GitHub.

apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: pullrequest-pipeline
spec:
  runAfter:
    taskRef: cleanup-post-pullrequest
  tasks:
    - name: start-github-ci-status
      taskRef:
        name: update-github-status
        params:
        - name: STATUS
          value: pending
        - name: CONTEXT
          value: ci-tests
        - name: COMMIT_SHA
          value: $(inputs.params.commit_sha)
    - name: run-tests
      taskRef:
        name: golang-test
      errorStrategy:
        taskRef: update-commit-status
        params:
        - name: STATUS
          value: failed
        - name: CONTEXT
          value: ci-tests
        - name: COMMIT_SHA
          value: $(inputs.params.commit_sha)
    - name: mark-github-ci-status-success
      runAfter:
        - run-tests
      taskRef:
        name: update-github-status
        params:
        - name: STATUS
          value: success
        - name: CONTEXT
          value: ci-tests
        - name: COMMIT_SHA
          value: $(inputs.params.commit_sha)
    # repeat pending for ci-lint context
    - name: run-lint
      taskRef:
        name: golangci-lint
      errorStrategy:
        taskRef: update-commit-status
        params:
        - name: STATUS
          value: failed
        - name: CONTEXT
          value: ci-lint
        - name: COMMIT_SHA
          value: $(inputs.params.commit_sha)
    # repeat success for ci-lint context

defer, recover, and skip sounds great but at the same time will need little bit of clarification

I agree, the keywords don’t make much sense in isolation. How about “AlwaysRun” (defer), “RunOnFail” (recover), and “RunOnSuccess” (Tekton’s current behaviour)?

I am trying to justify will not run because integration-tests never run, in Go, understanding of defer statement is, it pushes a function call onto a list and that list of calls are executed after the surrounding function returns. How would this impact on tasks defined after integration-tests?

I think the analogy here with go’s defer breaks down. I somewhat regret drawing the comparison. In my mind the strategy only describes a single relationship between a task and its “parents” (those it declares with “runAfter” or “from”). iow given the following tasks:

- name: Task A
- name: Task B
  runAfter:
    - Task A
  strategy: RunOnFail # Task B only executes if Task A errors out
- name: Task C
  runAfter:
    - Task B
  strategy: AlwaysRun

I expect the following behaviour:

  1. Task A runs
  2. Task B will only run if Task A fails.
  3. Task C will only run if Task B runs.
    • Because Task C declares “AlwaysRun” with “runAfter: Task B”.
    • If Task B never ran (Task A succeeded and B is only RunOnFail) then Task C never runs.

So I think that’s another reason why using the go keywords probably doesn’t make sense after all - they don’t map perfectly on to Tekton’s meanings. But AlwaysRun / RunOnFail / RunOnSuccess are a bit clearer maybe, especially when we consider them paired with runAfter.

Another alternative to consider: Go’s defer and recover keywords model quite similar behaviour to what we’re discussing here. I can imagine DeferredPipelineTask and RecoveryPipelineTask types that perform work regardless of prior outcome (Deferred) and in response to a task’s failure (Recovery). Examples:

DeferredPipelineTask

# In this example, a "deferred" task is used to clean up environment after integration tests.
# Deferred tasks run regardless of outcome in prior tasks
spec:
  tasks:
    - name: integration-tests # can fail!
      taskRef:
        name: run-some-tests
    - name: cleanup-integration-environment
      deferred: true # will run regardless of failure in integration-tests. Will not run if integration-tests is never run (i.e. because a task prior to integration-tests failed)
      runAfter: integration-tests
      taskRef:
        name: delete-integration-namespaces

RecoveryPipelineTask

# In this example, a "recovery" task is used to handle errors during deployment to staging.
# Recovery tasks only execute if the task they runAfter fails
spec:
  tasks:
    - name: deploy-to-staging
      taskRef:
        name: deploy-to-k8s
    - name: rollback-staging
      recovery: true # will run only if deploy-to-staging fails
      runAfter: deploy-to-staging
      taskRef:
        name: rollback-deployment

Two further tweaks to this idea: First, a DeferredPipelineTask that doesn’t declare a runAfter will always execute at the end of the pipeline. This is the “finally” clause equivalent. Second, a RecoveryPipelineTask with no runAfter will handle any error case in the pipeline. This is the equivalent of a giant catch { } block wrapped around your pipeline. We could even pass the error to the RecoveryPipelineTask as a PipelineResource or something to help it with reporting.

Also worth keeping in mind that while a DeferredPipelineTask or RecoverPipelineTask needs to be explicitly marked as such, I think they would also be allowed to be “roots” of their own trees. In other words another task could be runAfter a DeferredPipelineTask but does not need to include deferred: true. Similarly for recovery, a task could be runAfter a RecoveryPipelineTask but does not need to include recovery: true. In effect this allows entire branches of the execution DAG to be run only in the event of failure or for the purposes of cleanup etc.

So I think this would cover the following scenarios:

  1. Execute work after a specific task in the pipeline succeeds OR fails
    • DeferredPipelineTask with runAfter
    • Use cases: cleanup integration environment, upload unit test results
  2. Recover from failed tasks by jumping to a different branch
    • RecoveryPipelineTask with runAfter
    • Use case: roll back bad deployment
  3. Perform work at the end of a pipeline regardless of outcome
    • DeferredPipelineTask without runAfter
    • Use case: any naive finally scenario (“naive” here means it doesn’t need specific knowledge of what ran or didn’t run)
  4. Handle any error in the pipeline that occurs with a fallback task
    • RecoveryPipelineTask without runAfter
    • Use case: any naive catch { } scenario (example i can think of: send a message to slack that a pipeline has failed)

The deferred and recovery keys would need to be either-or in the yaml. I don’t think you can support both recovery: true and deferred: true on the same task.

What I most like about this approach is that:

  1. it doesn’t mess with runAfter, so avoids some possibly tricky schema changes in the yaml (particularly since from behaviour may also need to be modified to keep it in line with runAfter)
  2. it maintains the property that the “edge” in the graph is defined (with runAfter/from) in the same PipelineTask that the error handling or deferral behaviour is described
  3. it provides flexible catch-all handling to satisfy any jump / finally / catch requirements.
  4. It doesn’t rely on tricky-to-remember constants like “IgnorePriorErrors”.
  5. Finally (pun intended) what I like about this is that it drops the word “errorStrategy” completely. I think there are very legitimate use cases for these kinds of handlers that don’t involve errors or failures or anything negative at all. It’s just branching the DAG in response to specific outcomes of the graph nodes.

how about modeling the scenario that @bobcatfish mentioned above with:

    - name: integration
      runAfter: uts
    - name: integration2 # pretend there was another set of tests?
      runAfter: uts
      errorStrategy: SkipOnPriorTaskErrors # do not execute if previous integration tests fail
    - name: cleanup
      taskRef:
        name: cleanup-integration-test-junk
      runAfter: integration
      errorStrategy: IgnorePriorTaskErrors

woo, I like errorStrategies in runAfter and from, let me give it a thought 🤔…

There are some features in Tekton today

Yes @pritidesai I am looking for a feature which you have mentioned i.e ignoring a task failure at the pipeline authoring time. Currently is this feature available as an alpha feature?

Hey @email2smohanty this feature is not implemented yet. We are looking for help or if someone is available we can guide on how to implement this. Once implemented, yes it will be an alpha feature.

We have a strict requirement of not stopping the pipeline if any task is failing and it can not be achieved through finally, also we can not run the tasks in parallel. Based on this issue and tekton documentation I am assuming that we do not any configuration or setting at pipeline level to continue the pipeline execution in case of task failure. So can anyone please suggest how to tackle this issue?

@email2smohanty the current behaviour of PipelineRun is that as soon as Task fails, no new TaskRun will be scheduled, and the ones that are currently running will run to completion. Depending on the topology of the PipelineRun, there may be TaskRun that could have been executed, but we’re not because we already know that the pipeline would fail.

If I understand correctly, you would like the PipelineRun to continue running as many tasks as the pipeline topology allows, even in case of failure. In case task X fails, any task that depends from X in any way will not be executed, but any other task could still be executed.

There are some features in Tekton today that you could use to achieve something like that - as mentioned in https://github.com/tektoncd/pipeline/issues/1684#issuecomment-794253474 - but they require changes to Tasks and Pipeline.

If you need this feature, would you mind filing a separate issue about it?

Lots of great discussion on the design doc. I’m gonna summarize where we are at now:

runOn

The idea seems popular but instead of a list we might make it into a map.

     - name: task3
       runAfter: ["task1", "task2"]
       runOn:
         - task: task1
           states: ["success", "failure"]
         - task: task2
           states: ["success"]
--- instead of
     - name: task3
       runAfter: ["task1", "task2"]
       runOn: ["success", "failure"]

What’s nice about the map is that it is more powerful i.e. users can say run this task3 regardless of task1’s state but only if task2 succeeds. At the same time, its adds some duplication (we need both runAfter and runOn) and some extra validation on our side (e.g. we should not accept more tasks in runOn that are not already present in runAfter). In the future, we can get rid of runAfter in favor of this runOn!

pipeline level failureStrategy

Instead of adding a pipeline level failureStrategy, we could change the default behavior of pipeline execution from today’s fail on first failure to keep running independent branches of the pipeline until there are no more tests left to run. This would be a backwards incompatible change so we should decide on this sooner rather than later given the upcoming beta release!

cc @sbwsg @skaegi (might be related to https://github.com/tektoncd/pipeline/issues/1978#issuecomment-582941534) Also cc @vdemeester @bobcatfish re: beta release implications

One thing I want to add (though it is a bit tangential) is the idea of Pipeline Failure conditions. Right now, the pipeline bails early if anything fails. If I have multiple “branches” in my pipeline that are independent, I would expect non-dependent branches to run to completion separately from each other. An example:

apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: branched-pipeline
spec:
  tasks:
    - name: pre-work
      taskRef:
        name: install-dependencies
    - name: lint
      taskRef:
        name: run-linter
      runAfter:
        - pre-work
    - name: compile
      taskRef:
        name: run-compiler
      runAfter:
        - pre-work
    - name: deploy
      taskRef:
        name: deploy
      runAfter:
        - compile

If the lint task here fails, it will fail the whole pipeline even if the compile succeeds. In this scenario, I would expect/want the deploy to still happen. I could imagine other scenarios where one could want to make a task a show-stopper as well. I believe this calls for failure strategies on a Pipeline

I really like the idea of runOn: ["success", "failure", "skip"] == runOn: ["always"]. Keeps with the descriptive nature of Kubernetes and doesn’t require a loaded term (is always really always).

Hrm. AlwaysRun isn’t that great for the Finally case - it doesn’t make as much sense. Deferred may be better after all. Here’s a comparison:

AlwaysRun

# This pipeline pings a URL when the pipeline finishes.
# This ping happens regardless of the pipeline's outcome.
apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: test-pipeline
spec:
  tasks:
    - name: ping-url-on-complete
      taskRef:
        name: send-ping
      strategy: AlwaysRun # AlwaysRun without a runAfter. Executes at end of pipeline.
    - name: uts
      taskRef:
        name: run-unit-tests
    - name: integration
      taskRef:
        name: run-integration-tests
      runAfter: uts

Deferred

# This pipeline pings a URL when the pipeline finishes.
# This ping happens regardless of the pipeline's outcome.
apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: test-pipeline
spec:
  tasks:
    - name: ping-url-on-complete
      taskRef:
        name: send-ping
      strategy: Deferred # Deferred without a runAfter. Executes at end of pipeline.
    - name: uts
      taskRef:
        name: run-unit-tests
    - name: integration
      taskRef:
        name: run-integration-tests
      runAfter: uts

Another phrasing of the above approach that @dibyom and I discussed would be to use keywords for defer / recover / skip (the default):

spec:
  tasks:
    - name: deploy-to-staging
      taskRef:
        name: deploy-to-k8s
    - name: rollback-staging
      runAfter: deploy-to-staging
      strategy: Recover # or Defer or Skip
      taskRef:
        name: rollback-deployment

This ^ says that rollback-staging PipelineTask will only execute if deploy-to-staging fails (it “Recovers” from deploy-to-staging’s failure).

Having thought about it for a couple days I’m still pretty sure we could describe all of the use cases we’ve talked about so far with just these three strategies.