pipeline: Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up
Hi,
after upgrading tekton pipelines to v0.18.0, the reconciliation seems to be stuck or at least really slow. Here is a screenshot of the tekton_workqueue_depth metric:

The controller log is full of repeated “pod not found” messages like the following.
{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:528","msg":"Reconcile error","commit":"8eaaeaa","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:528\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:514\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}
{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"taskrun/reconciler.go:294","msg":"Returned an error","commit":"8eaaeaa","knative.dev/traceid":"bd9fd972-9191-44b3-b040-028215d651d2","knative.dev/key":"site/honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6","targetMethod":"ReconcileKind","targetMethod":"ReconcileKind","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile\n\tgithub.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:294\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:513\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}
We do have a cleanup job running in the cluster that deletes Pods of finished TaskRuns after some time. Before 0.18.0 this does not seem to be an issue for the controller.
Thanks, Fabian
Additional Info
- Kubernetes version: v1.18.9
- Tekton Pipeline version: v0.18.0
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (17 by maintainers)
Commits related to this issue
- Consider not-found pod as permanent when taskrun is done When the taskrun is "done" i.e. it has a completion time set, we try to stop sidecars. If the pod cannot be found, it has most likely been evi... — committed to afrittoli/pipeline by afrittoli 4 years ago
- Consider not-found pod as permanent when taskrun is done When the taskrun is "done" i.e. it has a completion time set, we try to stop sidecars. If the pod cannot be found, it has most likely been evi... — committed to tektoncd/pipeline by afrittoli 4 years ago
I believe the issue happens when trying to stopping sidecars. In case the pod is not found we return a non-permanent (!) error: https://github.com/tektoncd/pipeline/blob/473e3f3cc74b215f8c22cb007fcd7413f93f3917/pkg/reconciler/taskrun/taskrun.go#L215-L217
Since at this point the taskrun is marked as done, I think it is safe to assume that if the pod was not found, we can just ignore and finish reconcile. I will make a PR, it should be an easy fix.
Hi @pritidesai, this issue is not related to 3126, sorry for creating confusion by linking it. To reproduce it: