rancher: Rancher 2.6.5/2.6.6 liveness crashes the pod and restarts infinitely

Rancher Server Setup

Rancher version: 2.6.5 and 2.6.6
Installation option (Docker install/Helm Chart): Currently rancher-2.6.6 v2.6.6 but also tried with 2.6.5 before. Basically following the rancher/latest repo.

Information about the Cluster

Kubernetes version: v1.20.15+k3s1
Cluster Type (Local/Downstream): Local

User Information

What is the role of the user logged in? Admin

Describe the bug Rancher tries to start and fails. First I kept seeing Exit Code 137 (NOT OOM). After that, I changed deploy/rancher livenessProbe to

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 80
            scheme: HTTP
          initialDelaySeconds: 120
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5

The first time failed for:

2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-kmtbm: context canceled
E0703 10:16:11.471980      32 leaderelection.go:330] error retrieving resource lock kube-system/cattle-controllers: Get "https://10.0.188.63:443/api/v1/namespaces/kube-system/configmaps/cattle-controllers?timeout=15m0s": context canceled
I0703 10:16:11.472029      32 leaderelection.go:283] failed to renew lease kube-system/cattle-controllers: timed out waiting for the condition
E0703 10:16:11.472219      32 leaderelection.go:306] Failed to release lock: resource name may not be empty
I0703 10:16:11.473005      32 trace.go:205] Trace[367625047]: "Reflector ListAndWatch" name:pkg/mod/github.com/rancher/client-go@v1.23.3-rancher2/tools/cache/reflector.go:168 (03-Jul-2022 10:15:15.544) (total time: 55928ms):
Trace[367625047]: [55.928576086s] [55.928576086s] END
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-sqmd2: context canceled
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=RoleTemplate workers
2022/07/03 10:16:11 [INFO] Shutting down provisioning.cattle.io/v1, Kind=Cluster workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=CatalogTemplateVersion workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=RkeAddon workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=NodeDriver workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=PodSecurityPolicyTemplateProjectBinding workers
2022/07/03 10:16:11 [INFO] Shutting down catalog.cattle.io/v1, Kind=ClusterRepo workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=MultiClusterApp workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=AuthConfig workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=ProjectCatalog workers
2022/07/03 10:16:11 [ERROR] failed to call leader func: Get "https://10.0.188.63:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions": context canceled
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=NodePool workers
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-zr5m6: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-szxcd: context canceled
2022/07/03 10:16:11 [INFO] Shutting down rbac.authorization.k8s.io/v1, Kind=RoleBinding workers
2022/07/03 10:16:11 [INFO] Shutting down /v1, Kind=Namespace workers
2022/07/03 10:16:11 [INFO] Shutting down /v1, Kind=Secret workers
2022/07/03 10:16:11 [INFO] Shutting down rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding workers
2022/07/03 10:16:11 [INFO] Shutting down rbac.authorization.k8s.io/v1, Kind=Role workers
2022/07/03 10:16:11 [INFO] Shutting down rbac.authorization.k8s.io/v1, Kind=ClusterRole workers
2022/07/03 10:16:11 [INFO] Shutting down /v1, Kind=ServiceAccount workers
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-g6mz8: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-btxww: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-rr2bb: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-vcvh9: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-rsbzh: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-79bn8: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-kg9ch: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-2tw8h: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-58dwl: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-q8q2q: context canceled
2022/07/03 10:16:11 [FATAL] leaderelection lost for cattle-controllers

After that I extended initialDelaySeconds to 160 and then rancher kept failing for:

2022/07/03 10:20:11 [INFO] Shutting down apiextensions.k8s.io/v1, Kind=CustomResourceDefinition workers
2022/07/03 10:20:11 [INFO] Shutting down rbac.authorization.k8s.io/v1, Kind=Role workers
2022/07/03 10:20:11 [INFO] Shutting down management.cattle.io/v3, Kind=Group workers
2022/07/03 10:20:11 [INFO] Shutting down management.cattle.io/v3, Kind=CatalogTemplateVersion workers
2022/07/03 10:20:11 [INFO] Shutting down management.cattle.io/v3, Kind=NodeTemplate workers
2022/07/03 10:20:11 [INFO] Shutting down management.cattle.io/v3, Kind=MultiClusterAppRevision workers
2022/07/03 10:20:11 [INFO] Shutting down catalog.cattle.io/v1, Kind=App workers
2022/07/03 10:20:11 [INFO] Shutting down management.cattle.io/v3, Kind=ManagedChart workers
2022/07/03 10:20:11 [INFO] Starting management.cattle.io/v3, Kind=Cluster controller
2022/07/03 10:20:11 [INFO] Shutting down management.cattle.io/v3, Kind=Cluster workers
2022/07/03 10:20:11 [FATAL] failed to wait for caches to sync

Until at some point it came back to life.

Obviously this won’t last as once the deployment will be updated again, liveness will fail and the pod won’t start.

To Reproduce I am really not sure what causes this issue except of the things I have mentioned within the bug.

Additional context

I guess that the least that can be done here is to enable an option to extend the liveness/readiness settings through helm. However this is obviously a patch and a farther investigation is required.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 20 (9 by maintainers)

Most upvoted comments

The PR above and issues below are to add probe customization to the Rancher chart, but for now we won’t be adding to the Rancher docs “helm charts options” page. This issue will continue to track the “long term” fix about slow startup times.

cbron on Feb 7, 2023