rancher: Clusters become unreachable via UI and kubectl after some times

Rancher Server Setup

  • Rancher version: 2.6.4
  • Installation option (Docker install/Helm Chart): Helm, RKE1, K8S 1.21.10

Information about the Cluster

  • Kubernetes version: K8S 1.20.9
  • Cluster Type (Local/Downstream): Downsteam, the remote cluster was provisioned using docker command. We are able to reproduce this on two different cluster.

User Information

  • What is the role of the user logged in? Admin

Describe the bug We noticed since the upgrade from 2.5.8 to 2.6.4 that our cluster would become unresponsive via the UI and via Kubectl. However, weirdly enough, if we log via the old ui (removing the /dashboard in the url and add login), we still can’t see everything without issue…

To Reproduce The fastest way we found to reproduce is to spam a few times (10-15, not in parallel) the following command toward the cluster, the cluster is quite busy so we get a bunch of logs doing this: kubectl logs -n ingress-nginx -f <nignx-ingress-pod> or kubectl logs -n ingress-nginx <nignx-ingress-pod>

Result After a few attempts, it will hang, and any command via kubectl will hang (either all the time or once every 2 attempts at the start). The cluster is still displayed as connected and healthy in the UI.

Expected Result The command shouldn’t hang making the interaction with the cluster impossible. The situation never recover we are forced to kill all the rancher pod to restore the UI + kubectl.

Additional context We are unsure if the kubectl logs is the reason or if it’s just a quantity of data that is enough to block something in rancher… We are wondering if we aren’t in a similar case of this one where the routing that handle the forward to the cluster is blocked by a full buffer or something similar. But as the ticket is close, we aren’t sure if it could be the same. https://github.com/rancher/rancher/issues/34819#issuecomment-1007600901

We also see alot of message mentioned in this one: https://github.com/rancher/rancher/issues/36584 but we aren’t sure if they are the cause of our outages.

Also note that restarting the agent on the downstream or forcing a full connection interrupt between the agents and the cluster to make them reconnect doesn’t fix the issue either, only a full kill of the rancher pod fixes our issue at the moment

SURE-4484

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 36
  • Comments: 61 (4 by maintainers)

Most upvoted comments

We’re facing the same problem.

I am running Rancher 2.6.5 on Kubernetes v1.22.9-eks-a64ea69, using nginx-ingress.

I have found a way to reliably reproduce the issue on my server. First I created a pod on one of my clusters that generates a large amount of logs:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: snowcrash
  name: snowcrash
spec:
  replicas: 1
  selector:
    matchLabels:
      app: snowcrash
  serviceName: snowcrash
  template:
    metadata:
      labels:
        app: snowcrash
    spec:
      containers:
        - command:
            - sh
            - -c
            - while true ; do sleep 1 ; dd if=/dev/zero bs=8M count=32 | xxd -c 1048576 ; done
          image: alpine
          name: snowcrash
          resources:
            requests:
              memory: "256Mi"
              cpu: "100m"
            limits:
              memory: "256Mi"
              cpu: "100m"

Next, using a kubeconfig that utilizes Rancher’s auth proxy, I leave the following command running:

kubectl logs -f snowcrash-0

One instance of kubectl logs would eventually (5m?) make that cluster unreachable through the auth proxy. Running three of those in parallel would kill it in about 1-3 minutes. Restarting Rancher fixes everything. If I connect directly to the cluster instead of going through Rancher server’s proxy everything works fine.

I tried kubectl attach through the proxy, and could run several instances of that in parallel without experiencing any issues. There must be something different about kubectl logs that breaks things.

I can reproduce on rancher 2.6.4, can’t connect to imported clusters… (specially one of them which is in datacenter and much bigger than others). My rancher is a 2-node cluster on EKS 1.20

Some facts (for this imported cluster, which is on kube 1.21 and built with kubeadm)

  • Cluster works fine (and admin credentials work good)
  • Rancher cluster main page doesn’t load (c/c-rcjg6/explorer)
  • But if you click anywhere else (nodes/namespaces/…, c/c-rcjg6/explorer/projectsnamespaces) it works
  • And I can also click on “Kubectl shell” and it works perfectly from the rancher web console
  • However, accessing the cluster with rancher kubeconfig credentials from the internet doesn’t work
  • It also doesn’t work connecting with Curl (makes sense as i’m recreating the same queries that kubectl would do - i saw them with --v=9)
  • Querying svc/rancher through kubectl port-forwarding + curl also doesn’t work.
  • Restarting rancher server pods will make it work for a short period of time

Other things I tested:

  • Replace rancher EC2 nodes
  • Increase ALB timeouts
  • Increase nginx ingress timeouts and requests

Based on my experiments I’d say

  • It is not related to nginx ingress (because port forwarding has the same issue when connecting to EKS API directly with kubectl)
  • In my case, it does work from the rancher kubectl web cli, which means that the tunnel or something there still works fine
  • It seems to be related to the rancher service API, that takes too long or rejects some requests, not sure

We are also affected by this since the upgrade to Rancher 2.6.5. We were running 2.6.2. We can reproduce the issue in the same way with getting logs from multiple consoles on a downstream cluster. Restarting the rancher pods on the Rancher cluster and then the cattle-cluster-agent pods on the downstream cluster fixes it. For how long, no idea. Case has been logged with SUSE support.

It’s much more than that when you include all the vendored repo’s. Not an efficient angle to attack it from. However, we have started to be able to reproduce this in-house thanks to @EC-Sean 's comment. Still some sporadic behavior, but leaving multiple console’s following a logs feed is reproducing it at least some of the time. Hope to have a better update soon. We have started to look into specific commits between 2.6.3 and 2.6.4 which we think may be causing it.

Having the same issues as well with regards to going from 2.6.3 -> 2.6.4. But, can at least one of the downstream clusters use the kubectl from to UI.

Is there a way for us to monitor or detect the indefinite locks that are mentioned in the release notes for this version, or otherwise detect if this problem is affecting us? That will help us to prioritize our updates for this release.

A major performance issue was occurring when Rancher was attempting to control large volumes of traffic from downstream clusters. This mechanism was not handling disconnects properly and would result in indefinite locks

I’ve also confirmed on 2.6.5 + just remotedialer patch - issue resolved, thank you!

@medicol69 We were affected on this remote dialer thing for a very long time with sporadic outages on downstream clusters (mostly with high load on Kube API). This is done in 2.6.6. If you facing this again, you should verify if all agents are really updated to 2.6.6. Or it’s another incident.

@konih the value of --api-audiences needs to match the one from --service-account-issuer in our case (I am in the same team as @GentianRrafshi ).

This only makes sense if you have --service-account-api-audience set and are running a Kubernetes version affected by the deprecation I guess.

We had --service-account-api-audience set as we were using konnectivity in the past. Would be interesting as well, if everyone affected by this issue has enabled this flag.

It is the only change we did since we can’t reproduce the rancher agent issues anymore so we are not sure if this is the real cause. Our Kubernetes clusters are running 1.22.9

I hope this helps 😃

Hi,

we have fixed this Issue in a way, that so far no one has mentioned. We had a problem with our rancher setup. We determined that this command: kubectl get pods -A | awk '{ system ("kubectl logs --all-containers " $2 " -n " $1) }' pretty reliably crashed rancher with the error about remotedial mentioned above.

After some research, where we for example activated log rotate as mentioned, we found a flag in the api, that seems to have caused this.

We used the API extraArg --service-account-api-audience before upgrading to Rancher version 2.6.5. The key-value is deprecated and is now called --api-audiences.

We missed that, and after some research realized, that rancher ignored --service-account-api-audience and set --api-audiences, but to unknown. So we removed the Flag --service-account-api-audience and added --api-audiences=api (which rancher itself set to --api-audiences=unknown) and after that, everything works fine.

Maybe this causes the error and this is the fix? Could someone verify that?

It’s good to see the issue will be resolved soon, is anyone in this thread able to test if the following PRs resolve their issue? https://github.com/rancher/rancher/pull/38096 https://github.com/rancher/remotedialer/pull/46

Hello all, thanks for the information so far. We are trying to piece together the problem here. So far we have been unable to reproduce this on the Rancher side. Some questions to tie threads together:

  • For those that see this on only some of your clusters, are there any common traits between those affected clusters ? This could be things like k8s distro, k8s version, imported vs provisioned, or cluster usage.
  • Is anyone seeing this issue below k8s 1.21 ?

yes. 1.20.11

  • Has anyone seen this issue before Rancher 2.6.4 ?

No. The issue surfaced after the upgrade to 2.6.4

  • Are people seeing high CPU usage on the downstream agent when this happens ?

No.

  • Does Rancher have high CPU or memory usage ?

No.

  • Does this only happen during high Rancher usage hours ? Meaning people updating things via Rancher or getting logs.

I think yes. Usually find out because someone is using kubectl and starts having issues

Downstream workload activity shouldn’t matter here.

  • What do affected downstream clusters have in them. What amount of namespaces, workloads, rolebindings etc… This info we are looking for specifically to try and reproduce the bug.

It doesn’t seem to be associated with any particular cluster (we have about 40 clusters)

@cbron We can reproduce this by obtaining logs from any pod with a large amount of logs.

@beatsandpics exactly: kubectl rollout restart -n=cattle-system deployment rancher

BTW it happened to us today. So ~4 days since the last restart. We set up monitoring and alerting using kube-prometheus probes to get alerted as soon as it pops up again. this is the URL we continuously check: https://rancher.mycompany.com/k8s/clusters/c-clusterid/api?timeout=32s

I just found the issue you all created, and believe the issue I commented on/joined (https://github.com/rancher/rancher/issues/37174) is related. Do any of you notice that kubectl shell is no longer working once you’ve upgraded from Rancher v2.6.2 to v2.6.4? Can’t use the kubectl tool from the menu bar, nor can you do “execute shell” to connect to any pods.

the same issue… =/ one of the 3 provisioned cluster become partly unavailable. not working Cluster Summary Page, external link to Longhorn UI , to Prometheus UI with proxy uRL