pipeline: Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up

Hi,

after upgrading tekton pipelines to v0.18.0, the reconciliation seems to be stuck or at least really slow. Here is a screenshot of the tekton_workqueue_depth metric:

The controller log is full of repeated “pod not found” messages like the following.

{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:528","msg":"Reconcile error","commit":"8eaaeaa","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:528\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:514\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}
{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"taskrun/reconciler.go:294","msg":"Returned an error","commit":"8eaaeaa","knative.dev/traceid":"bd9fd972-9191-44b3-b040-028215d651d2","knative.dev/key":"site/honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6","targetMethod":"ReconcileKind","targetMethod":"ReconcileKind","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile\n\tgithub.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:294\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:513\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}

We do have a cleanup job running in the cluster that deletes Pods of finished TaskRuns after some time. Before 0.18.0 this does not seem to be an issue for the controller.

Thanks, Fabian

Additional Info

Kubernetes version: v1.18.9
Tekton Pipeline version: v0.18.0

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 17 (17 by maintainers)

Commits related to this issue

Consider not-found pod as permanent when taskrun is done When the taskrun is "done" i.e. it has a completion time set, we try to stop sidecars. If the pod cannot be found, it has most likely been evi... — committed to afrittoli/pipeline by afrittoli 4 years ago
Consider not-found pod as permanent when taskrun is done When the taskrun is "done" i.e. it has a completion time set, we try to stop sidecars. If the pod cannot be found, it has most likely been evi... — committed to tektoncd/pipeline by afrittoli 4 years ago

Most upvoted comments

I believe the issue happens when trying to stopping sidecars. In case the pod is not found we return a non-permanent (!) error: https://github.com/tektoncd/pipeline/blob/473e3f3cc74b215f8c22cb007fcd7413f93f3917/pkg/reconciler/taskrun/taskrun.go#L215-L217

Since at this point the taskrun is marked as done, I think it is safe to assume that if the pod was not found, we can just ignore and finish reconcile. I will make a PR, it should be an easy fix.

afrittoli on Nov 19, 2020

Hi @pritidesai, this issue is not related to 3126, sorry for creating confusion by linking it. To reproduce it:

create a cluster and create some PipelineRuns in it. Any pipeline should work.
wait for all PipelineRuns to complete
delete the pods of the executed tasks (but keep the PipelineRuns and TaskRuns)
restart the tekton pipelines controller
check the tekton_workqueue_depth metric, this will take some time to reach 0 (much longer compared with versions before 0.18.0)

Fabian-K on Nov 19, 2020