pipeline: If node that runs the taskrun pod shutdown then retry will not work as expected

Expected Behavior

When the node shutdown and the tasrkrun pod running on the node will be failed. And if we set retry on taskrun we would expect the retried pod to start in a normal node and taskrun can continue to work.

Actual Behavior

Taskrun will use the failed pod as the work pod and will not create a new pod for retry.

Steps to Reproduce the Problem

  1. create a taskrun with retry in a k8s cluster that has multi nodes
  2. shutdown the node which running the pod
  3. check the status of taskrun

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.6", GitCommit:"ff2c119726cc1f8926fb0585c74b25921e866a28", GitTreeState:"clean", BuildDate:"2023-01-18T19:22:09Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.6", GitCommit:"ff2c119726cc1f8926fb0585c74b25921e866a28", GitTreeState:"clean", BuildDate:"2023-01-18T19:15:26Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:
v0.41.0

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Looking at the history, this check on tr.Status.PodName was deleted but then it was reverted back.

Introduced in https://github.com/tektoncd/pipeline/commit/0f20c3539f25ede46dfe58b83924e28db1fd783e to stop looking up the pod for a taskRun by name but instead only look up by labelSelector.

This adds Reconciler.getPod, which looks up the Pod for a TaskRun by performing a label selector query on Pods, looking for the label we apply to Pods generated by TaskRuns.

If zero Pods are returned, itโ€™s the same as .status.podName being โ€œโ€. If multiple Pods are returned, thatโ€™s an error.

This commit was reverted in https://github.com/tektoncd/pipeline/issues/1944 because back then we had multiple pods associated to single taskRun object and was not easy to identify when to declare done for a taskRun.

Further reference: https://github.com/tektoncd/pipeline/issues/1689

We can certainly update pod creation implementation since we have ownerReference implemented now but I think that could be something nice to have and not causing any issues here.

Thank you @yuzp1996 for reporting this issue ๐Ÿ™

image

Why are these pod names same for multiple attempts? @XinruZhang @lbernick can we please reproduce this ๐Ÿ™

As per our retry strategies, the pod names must be unique for each retry attempt if I am not mistaken.