pipeline: If node that runs the taskrun pod shutdown then retry will not work as expected

Expected Behavior

When the node shutdown and the tasrkrun pod running on the node will be failed. And if we set retry on taskrun we would expect the retried pod to start in a normal node and taskrun can continue to work.

Actual Behavior

Taskrun will use the failed pod as the work pod and will not create a new pod for retry.

Steps to Reproduce the Problem

create a taskrun with retry in a k8s cluster that has multi nodes
shutdown the node which running the pod
check the status of taskrun

Additional Info

Kubernetes version:

Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.6", GitCommit:"ff2c119726cc1f8926fb0585c74b25921e866a28", GitTreeState:"clean", BuildDate:"2023-01-18T19:22:09Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.6", GitCommit:"ff2c119726cc1f8926fb0585c74b25921e866a28", GitTreeState:"clean", BuildDate:"2023-01-18T19:15:26Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version:

v0.41.0

About this issue

Original URL
State: closed
Created a year ago
Comments: 20 (16 by maintainers)

Commits related to this issue

Retry using different pod when node shutdown. When node shutdown then the retry pod will always use the same pod because it can not recongize that the pod can not work anymore and k8s can not delete ... — committed to yuzp1996/pipeline by yuzp1996 a year ago
Retry using a different pod when the node shutdown. When the node shutdown then the retry pod will always be the same pod because it can not recognise that the pod can not work anymore and k8s can no... — committed to yuzp1996/pipeline by yuzp1996 a year ago

Most upvoted comments

Looking at the history, this check on tr.Status.PodName was deleted but then it was reverted back.

Introduced in https://github.com/tektoncd/pipeline/commit/0f20c3539f25ede46dfe58b83924e28db1fd783e to stop looking up the pod for a taskRun by name but instead only look up by labelSelector.

This adds Reconciler.getPod, which looks up the Pod for a TaskRun by performing a label selector query on Pods, looking for the label we apply to Pods generated by TaskRuns.

If zero Pods are returned, it’s the same as .status.podName being “”. If multiple Pods are returned, that’s an error.

This commit was reverted in https://github.com/tektoncd/pipeline/issues/1944 because back then we had multiple pods associated to single taskRun object and was not easy to identify when to declare done for a taskRun.

Further reference: https://github.com/tektoncd/pipeline/issues/1689

We can certainly update pod creation implementation since we have ownerReference implemented now but I think that could be something nice to have and not causing any issues here.

pritidesai on Apr 25, 2023

Thank you @yuzp1996 for reporting this issue 🙏

Why are these pod names same for multiple attempts? @XinruZhang @lbernick can we please reproduce this 🙏

As per our retry strategies, the pod names must be unique for each retry attempt if I am not mistaken.

pritidesai on Apr 24, 2023