pipeline: Design: Failure Strategy for TaskRuns in a PipelineRun
The goal is to come up with a design to handle failing task runs in a pipelinerun. Today, we simply fail the entire pipelinerun if a single taskrun fails.
Current Status
Summary in this comment: https://github.com/tektoncd/pipeline/issues/1684#issuecomment-611016087
Ideas
Here are a couple of ideas from @sbwsg and me:
- Introduce an
errorStrategyfield inPipelineTaskssimilar to the idea in #1573 - The
errorStrategycould be under therunAfterfield. - To start off, we could have two error strategies :
FailPipelinewhich is the default for today, andContinuePipelinewhich will continue running the whole pipeline - Later on, we could add branch based error strategies e.g. fail one one branch of the graph but continue running the remaining pipelines
Additional Info
@sbwsg has some strawperson YAMLs: RunNextTasks for an integration test cleanup scenario FailPipeline(default) for a unit test failing before a deploy task
Use Cases
- Unit test fails but integration still run
- Rollbacks for CD e.g. Canaries - rollback if canary analysis fails
- Cleanup task if integration test fails
- Always run a step/task at the end e.g. to Report results
- Run on conditional failures #1023
Related Issues
The Epic #1376 has all the related issues
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 4
- Comments: 56 (33 by maintainers)
One idea - instead of
failure/error/executionStrategy, we could have a field likerunOn(or simplyonorwhen) that takes in a list of states that the parent taskruns have to be in for it to run (default is: success):What I like about this is that the field name is more succinct and for the user instead of having to remember a bunch of magic strings (is it
RunOnSuccessOrSkiporRunOnSkipAndSuccess), they just need to remember the 3 taskrun states e.g. “success”, “failure”, “skip”A few more examples here: https://gist.github.com/dibyom/92dfd6ea20f13f5c769a21389df53977
Status update: We are currently punting on the
runOnsyntax. Instead, we are:finallyfield that always runs some tasks at the end of a pipeline. (docWe are also considering adding the following (discussions ongoing in the API working group):
onError/exceptthat runs a Task if any task in a pipeline failsfinally/onErrorStep within a Task or within a PipelineTaskSome discussion here
I feel like it’s fair to consider this closed now that we have finally, tho there are more features to add, and to get the complete set of flexibility someone might want, i think we need to add in #2134 as well
@pritidesai is implementing a
finallyfield for always running tasks at the end of a pipeline. Hope that helps with some of your use cases. Beyond that, we are considering:Would this be sufficient for the Kubeflow use cases?
hey folks - this is one of the key requirements for the work we are leading from Kubeflow side to run on top of Tekton. Would be great to get current status, and see how we can accelerate this
cc @tomcli @afrittoli @skaegi
@pritidesai @bigkevmcd @pierretasci Thanks a lot for adding such detailed use cases! Very very helpful 🙏
@pritidesai For your use case – the current behavior for conditionals is that if a task is skipped, it dependents (identified using the
runAfterandfromfields) are automatically skipped. The overall pipelinerun status will be determined from the status of thenon-skippedtasks. And the default and only strategy today is theRunOnSuccess. Though I guess there could bestrategiessuch asRunOnSuccessOrSkiporRunAlwayswhich can be combined with conditionals for more complex pipelines.@bigkevmcd Updating status is definitely a very important use case:
top level
runAfter- this is the pipeline levelfinallyuse case. It seems like we’d have to add something like this. The alternative would be to have one task that hasrunAftersset so that it runs after all other tasks and astrategyset toRunAlways. This can be unwieldy since anytime you add a new Task to the pipeline, you’d have to manually make sure that the task is still that last thing that executes.errorStrategycontaining ataskRef- this is interesting! And in some ways more descriptive than adding a generic task with arunAfterand aerrorStategy: RunOnFailure. Are there other benefits? One thing I like about keeping thetaskRefsseparate is that then we can have multiple tasks that can run/be chained together (e.g. you can have both acleanup-test-envtask as well as aupdate-github-taskthat runs when the test failsOn passing status to tasks – we had a proposal in https://github.com/tektoncd/pipeline/issues/1020 though the current way of doing so is to pass in the pipelineRun name and then using
kubectlwithin the task to fetch the status. (I think @afrittoli might also be doing something here re: Notifications design work)@pierretasci Sounds like the
RunAlwaysstrategy is what you’d need for theupload-test-results-stepin your example. I do like the idea of usingconditionalsas sort of the extension mechanism for more complicated strategies – the basic strategies such asRunAlways,RunOnSuccess/Failure/Skipetc. are built-in while a user can use those plus a conditional to describe complex strategies (e.g. a strategy ofRunAlwaysplus a conditional for if two of the three tasks failed or whatever)One specific use case that I don’t think has been explicitly mentioned above is in a fan-in/out scenario.
For example, if my pipeline is
Here, I would want to always run the
upload-test-resultstask regardless of whether 0, 1, or both of the tasks preceding it failed.To me, this reads a lot like conditional execution but more like, conditional failure. Perhaps, this could be served as an extension to the conditions that already exist. If you wanted to “always execute B after A” your condition could simply always return true to override the default behavior of “execute B after A if A is successful”
Another slightly different case:
For things like updating GitHub status notifications it would be nice if we could do something like the following…admittedly this is a bit repetitive, but passing the “success” or “failure” of a task might work with the “recover” strategy mentioned earlier, which would mean that after each task, somehow it’d use the success/failure of the previous task to update the GitHub status appropriately.
Updating these kinds of statuses would be really useful if you want your pipeline to determine whether or not a commit can be merged (if you’re not familiar with these, you can require specific contexts to be successful before a PR can be merged).
This also adds a
runAfterpipeline-scopedtaskRef, which could do the cleanup in a “Go defer” way, i.e. always after the pipeline has ended, irrespective of how what caused it to end.The example below would trigger two parallel executions (lint and tests), which would report in their status to GitHub.
I agree, the keywords don’t make much sense in isolation. How about “AlwaysRun” (defer), “RunOnFail” (recover), and “RunOnSuccess” (Tekton’s current behaviour)?
I think the analogy here with go’s defer breaks down. I somewhat regret drawing the comparison. In my mind the strategy only describes a single relationship between a task and its “parents” (those it declares with “runAfter” or “from”). iow given the following tasks:
I expect the following behaviour:
So I think that’s another reason why using the go keywords probably doesn’t make sense after all - they don’t map perfectly on to Tekton’s meanings. But AlwaysRun / RunOnFail / RunOnSuccess are a bit clearer maybe, especially when we consider them paired with
runAfter.Another alternative to consider: Go’s defer and recover keywords model quite similar behaviour to what we’re discussing here. I can imagine
DeferredPipelineTaskandRecoveryPipelineTasktypes that perform work regardless of prior outcome (Deferred) and in response to a task’s failure (Recovery). Examples:DeferredPipelineTask
RecoveryPipelineTask
Two further tweaks to this idea: First, a
DeferredPipelineTaskthat doesn’t declare arunAfterwill always execute at the end of the pipeline. This is the “finally” clause equivalent. Second, aRecoveryPipelineTaskwith norunAfterwill handle any error case in the pipeline. This is the equivalent of a giantcatch { }block wrapped around your pipeline. We could even pass the error to theRecoveryPipelineTaskas a PipelineResource or something to help it with reporting.Also worth keeping in mind that while a DeferredPipelineTask or RecoverPipelineTask needs to be explicitly marked as such, I think they would also be allowed to be “roots” of their own trees. In other words another task could be
runAftera DeferredPipelineTask but does not need to includedeferred: true. Similarly for recovery, a task could berunAftera RecoveryPipelineTask but does not need to includerecovery: true. In effect this allows entire branches of the execution DAG to be run only in the event of failure or for the purposes of cleanup etc.So I think this would cover the following scenarios:
DeferredPipelineTaskwith runAfterRecoveryPipelineTaskwith runAfterDeferredPipelineTaskwithout runAfterfinallyscenario (“naive” here means it doesn’t need specific knowledge of what ran or didn’t run)RecoveryPipelineTaskwithout runAftercatch { }scenario (example i can think of: send a message to slack that a pipeline has failed)The deferred and recovery keys would need to be either-or in the yaml. I don’t think you can support both recovery: true and deferred: true on the same task.
What I most like about this approach is that:
runAfter, so avoids some possibly tricky schema changes in the yaml (particularly sincefrombehaviour may also need to be modified to keep it in line with runAfter)runAfter/from) in the same PipelineTask that the error handling or deferral behaviour is describedhow about modeling the scenario that @bobcatfish mentioned above with:
woo, I like errorStrategies in
runAfterandfrom, let me give it a thought 🤔…Hey @email2smohanty this feature is not implemented yet. We are looking for help or if someone is available we can guide on how to implement this. Once implemented, yes it will be an alpha feature.
@email2smohanty the current behaviour of
PipelineRunis that as soon asTaskfails, no newTaskRunwill be scheduled, and the ones that are currently running will run to completion. Depending on the topology of thePipelineRun, there may beTaskRunthat could have been executed, but we’re not because we already know that the pipeline would fail.If I understand correctly, you would like the
PipelineRunto continue running as many tasks as the pipeline topology allows, even in case of failure. In case task X fails, any task that depends from X in any way will not be executed, but any other task could still be executed.There are some features in Tekton today that you could use to achieve something like that - as mentioned in https://github.com/tektoncd/pipeline/issues/1684#issuecomment-794253474 - but they require changes to Tasks and Pipeline.
If you need this feature, would you mind filing a separate issue about it?
Lots of great discussion on the design doc. I’m gonna summarize where we are at now:
runOnThe idea seems popular but instead of a list we might make it into a map.
What’s nice about the map is that it is more powerful i.e. users can say run this task3 regardless of task1’s state but only if task2 succeeds. At the same time, its adds some duplication (we need both runAfter and runOn) and some extra validation on our side (e.g. we should not accept more tasks in runOn that are not already present in runAfter). In the future, we can get rid of
runAfterin favor of thisrunOn!pipeline level
failureStrategyInstead of adding a pipeline level failureStrategy, we could change the default behavior of pipeline execution from today’s fail on first failure to keep running independent branches of the pipeline until there are no more tests left to run. This would be a backwards incompatible change so we should decide on this sooner rather than later given the upcoming beta release!
cc @sbwsg @skaegi (might be related to https://github.com/tektoncd/pipeline/issues/1978#issuecomment-582941534) Also cc @vdemeester @bobcatfish re: beta release implications
One thing I want to add (though it is a bit tangential) is the idea of Pipeline Failure conditions. Right now, the pipeline bails early if anything fails. If I have multiple “branches” in my pipeline that are independent, I would expect non-dependent branches to run to completion separately from each other. An example:
If the
linttask here fails, it will fail the whole pipeline even if the compile succeeds. In this scenario, I would expect/want thedeployto still happen. I could imagine other scenarios where one could want to make a task a show-stopper as well. I believe this calls for failure strategies on a PipelineI really like the idea of
runOn: ["success", "failure", "skip"]==runOn: ["always"]. Keeps with the descriptive nature of Kubernetes and doesn’t require a loaded term (is always really always).Hrm.
AlwaysRunisn’t that great for the Finally case - it doesn’t make as much sense. Deferred may be better after all. Here’s a comparison:AlwaysRun
Deferred
Another phrasing of the above approach that @dibyom and I discussed would be to use keywords for defer / recover / skip (the default):
This ^ says that
rollback-stagingPipelineTask will only execute ifdeploy-to-stagingfails (it “Recovers” from deploy-to-staging’s failure).Having thought about it for a couple days I’m still pretty sure we could describe all of the use cases we’ve talked about so far with just these three strategies.