pipeline: If node that runs the taskrun pod shutdown then retry will not work as expected
Expected Behavior
When the node shutdown and the tasrkrun pod running on the node will be failed. And if we set retry on taskrun we would expect the retried pod to start in a normal node and taskrun can continue to work.
Actual Behavior
Taskrun will use the failed pod as the work pod and will not create a new pod for retry.
Steps to Reproduce the Problem
- create a taskrun with retry in a k8s cluster that has multi nodes
- shutdown the node which running the pod
- check the status of taskrun
Additional Info
-
Kubernetes version:
Output of
kubectl version:
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.6", GitCommit:"ff2c119726cc1f8926fb0585c74b25921e866a28", GitTreeState:"clean", BuildDate:"2023-01-18T19:22:09Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.6", GitCommit:"ff2c119726cc1f8926fb0585c74b25921e866a28", GitTreeState:"clean", BuildDate:"2023-01-18T19:15:26Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
- Tekton Pipeline version:
v0.41.0
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (16 by maintainers)
Commits related to this issue
- Retry using different pod when node shutdown. When node shutdown then the retry pod will always use the same pod because it can not recongize that the pod can not work anymore and k8s can not delete ... — committed to yuzp1996/pipeline by yuzp1996 a year ago
- Retry using a different pod when the node shutdown. When the node shutdown then the retry pod will always be the same pod because it can not recognise that the pod can not work anymore and k8s can no... — committed to yuzp1996/pipeline by yuzp1996 a year ago
Looking at the history, this check on
tr.Status.PodNamewas deleted but then it was reverted back.Introduced in https://github.com/tektoncd/pipeline/commit/0f20c3539f25ede46dfe58b83924e28db1fd783e to stop looking up the pod for a taskRun by name but instead only look up by labelSelector.
This commit was reverted in https://github.com/tektoncd/pipeline/issues/1944 because back then we had multiple pods associated to single taskRun object and was not easy to identify when to declare
donefor a taskRun.Further reference: https://github.com/tektoncd/pipeline/issues/1689
We can certainly update pod creation implementation since we have ownerReference implemented now but I think that could be something nice to have and not causing any issues here.
Thank you @yuzp1996 for reporting this issue ๐
Why are these pod names same for multiple attempts? @XinruZhang @lbernick can we please reproduce this ๐
As per our
retrystrategies, the pod names must be unique for each retry attempt if I am not mistaken.