pipeline: Slow/stuck reconciliation after 0.18.0 upgrade when completed Pods are cleaned up

Hi,

after upgrading tekton pipelines to v0.18.0, the reconciliation seems to be stuck or at least really slow. Here is a screenshot of the tekton_workqueue_depth metric:

image

The controller log is full of repeated “pod not found” messages like the following.

{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"controller/controller.go:528","msg":"Reconcile error","commit":"8eaaeaa","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).handleErr\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:528\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:514\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}
{"level":"error","ts":"2020-11-12T15:03:10.116Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-taskrun.Reconciler","caller":"taskrun/reconciler.go:294","msg":"Returned an error","commit":"8eaaeaa","knative.dev/traceid":"bd9fd972-9191-44b3-b040-028215d651d2","knative.dev/key":"site/honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6","targetMethod":"ReconcileKind","targetMethod":"ReconcileKind","error":"pods \"honeycomb-beekeeper-ci-dfj8f-kustomize-tltx6-pod-qfjn5\" not found","stacktrace":"github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile\n\tgithub.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:294\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:513\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller.(*Impl).RunContext.func3\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:451"}

We do have a cleanup job running in the cluster that deletes Pods of finished TaskRuns after some time. Before 0.18.0 this does not seem to be an issue for the controller.

Thanks, Fabian

Additional Info

  • Kubernetes version: v1.18.9
  • Tekton Pipeline version: v0.18.0

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (17 by maintainers)

Commits related to this issue

Most upvoted comments

I believe the issue happens when trying to stopping sidecars. In case the pod is not found we return a non-permanent (!) error: https://github.com/tektoncd/pipeline/blob/473e3f3cc74b215f8c22cb007fcd7413f93f3917/pkg/reconciler/taskrun/taskrun.go#L215-L217

Since at this point the taskrun is marked as done, I think it is safe to assume that if the pod was not found, we can just ignore and finish reconcile. I will make a PR, it should be an easy fix.

Hi @pritidesai, this issue is not related to 3126, sorry for creating confusion by linking it. To reproduce it:

  • create a cluster and create some PipelineRuns in it. Any pipeline should work.
  • wait for all PipelineRuns to complete
  • delete the pods of the executed tasks (but keep the PipelineRuns and TaskRuns)
  • restart the tekton pipelines controller
  • check the tekton_workqueue_depth metric, this will take some time to reach 0 (much longer compared with versions before 0.18.0)