rancher: Clusters become unreachable via UI and kubectl after some times
Rancher Server Setup
- Rancher version: 2.6.4
- Installation option (Docker install/Helm Chart): Helm, RKE1, K8S 1.21.10
Information about the Cluster
- Kubernetes version: K8S 1.20.9
- Cluster Type (Local/Downstream): Downsteam, the remote cluster was provisioned using docker command. We are able to reproduce this on two different cluster.
User Information
- What is the role of the user logged in? Admin
Describe the bug We noticed since the upgrade from 2.5.8 to 2.6.4 that our cluster would become unresponsive via the UI and via Kubectl. However, weirdly enough, if we log via the old ui (removing the /dashboard in the url and add login), we still can’t see everything without issue…
To Reproduce
The fastest way we found to reproduce is to spam a few times (10-15, not in parallel) the following command toward the cluster, the cluster is quite busy so we get a bunch of logs doing this:
kubectl logs -n ingress-nginx -f <nignx-ingress-pod>
or kubectl logs -n ingress-nginx <nignx-ingress-pod>
Result After a few attempts, it will hang, and any command via kubectl will hang (either all the time or once every 2 attempts at the start). The cluster is still displayed as connected and healthy in the UI.
Expected Result The command shouldn’t hang making the interaction with the cluster impossible. The situation never recover we are forced to kill all the rancher pod to restore the UI + kubectl.
Additional context We are unsure if the kubectl logs is the reason or if it’s just a quantity of data that is enough to block something in rancher… We are wondering if we aren’t in a similar case of this one where the routing that handle the forward to the cluster is blocked by a full buffer or something similar. But as the ticket is close, we aren’t sure if it could be the same. https://github.com/rancher/rancher/issues/34819#issuecomment-1007600901
We also see alot of message mentioned in this one: https://github.com/rancher/rancher/issues/36584 but we aren’t sure if they are the cause of our outages.
Also note that restarting the agent on the downstream or forcing a full connection interrupt between the agents and the cluster to make them reconnect doesn’t fix the issue either, only a full kill of the rancher pod fixes our issue at the moment
SURE-4484
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 36
- Comments: 61 (4 by maintainers)
We’re facing the same problem.
I am running Rancher 2.6.5 on Kubernetes v1.22.9-eks-a64ea69, using nginx-ingress.
I have found a way to reliably reproduce the issue on my server. First I created a pod on one of my clusters that generates a large amount of logs:
Next, using a kubeconfig that utilizes Rancher’s auth proxy, I leave the following command running:
One instance of
kubectl logs
would eventually (5m?) make that cluster unreachable through the auth proxy. Running three of those in parallel would kill it in about 1-3 minutes. Restarting Rancher fixes everything. If I connect directly to the cluster instead of going through Rancher server’s proxy everything works fine.I tried
kubectl attach
through the proxy, and could run several instances of that in parallel without experiencing any issues. There must be something different aboutkubectl logs
that breaks things.I can reproduce on rancher 2.6.4, can’t connect to imported clusters… (specially one of them which is in datacenter and much bigger than others). My rancher is a 2-node cluster on EKS 1.20
Some facts (for this imported cluster, which is on kube 1.21 and built with kubeadm)
Other things I tested:
Based on my experiments I’d say
We are also affected by this since the upgrade to Rancher 2.6.5. We were running 2.6.2. We can reproduce the issue in the same way with getting logs from multiple consoles on a downstream cluster. Restarting the rancher pods on the Rancher cluster and then the cattle-cluster-agent pods on the downstream cluster fixes it. For how long, no idea. Case has been logged with SUSE support.
It’s much more than that when you include all the vendored repo’s. Not an efficient angle to attack it from. However, we have started to be able to reproduce this in-house thanks to @EC-Sean 's comment. Still some sporadic behavior, but leaving multiple console’s following a logs feed is reproducing it at least some of the time. Hope to have a better update soon. We have started to look into specific commits between 2.6.3 and 2.6.4 which we think may be causing it.
Having the same issues as well with regards to going from 2.6.3 -> 2.6.4. But, can at least one of the downstream clusters use the kubectl from to UI.
Is there a way for us to monitor or detect the indefinite locks that are mentioned in the release notes for this version, or otherwise detect if this problem is affecting us? That will help us to prioritize our updates for this release.
I’ve also confirmed on 2.6.5 + just
remotedialer
patch - issue resolved, thank you!@medicol69 We were affected on this remote dialer thing for a very long time with sporadic outages on downstream clusters (mostly with high load on Kube API). This is done in 2.6.6. If you facing this again, you should verify if all agents are really updated to 2.6.6. Or it’s another incident.
@konih the value of
--api-audiences
needs to match the one from--service-account-issuer
in our case (I am in the same team as @GentianRrafshi ).This only makes sense if you have
--service-account-api-audience
set and are running a Kubernetes version affected by the deprecation I guess.We had
--service-account-api-audience
set as we were using konnectivity in the past. Would be interesting as well, if everyone affected by this issue has enabled this flag.It is the only change we did since we can’t reproduce the rancher agent issues anymore so we are not sure if this is the real cause. Our Kubernetes clusters are running 1.22.9
I hope this helps 😃
Hi,
we have fixed this Issue in a way, that so far no one has mentioned. We had a problem with our rancher setup. We determined that this command:
kubectl get pods -A | awk '{ system ("kubectl logs --all-containers " $2 " -n " $1) }'
pretty reliably crashed rancher with the error about remotedial mentioned above.After some research, where we for example activated log rotate as mentioned, we found a flag in the api, that seems to have caused this.
We used the API extraArg
--service-account-api-audience
before upgrading to Rancher version 2.6.5. The key-value is deprecated and is now called--api-audiences
.We missed that, and after some research realized, that rancher ignored
--service-account-api-audience
and set--api-audiences
, but to unknown. So we removed the Flag--service-account-api-audience
and added--api-audiences=api
(which rancher itself set to--api-audiences=unknown
) and after that, everything works fine.Maybe this causes the error and this is the fix? Could someone verify that?
It’s good to see the issue will be resolved soon, is anyone in this thread able to test if the following PRs resolve their issue? https://github.com/rancher/rancher/pull/38096 https://github.com/rancher/remotedialer/pull/46
yes. 1.20.11
No. The issue surfaced after the upgrade to 2.6.4
No.
No.
I think yes. Usually find out because someone is using kubectl and starts having issues
It doesn’t seem to be associated with any particular cluster (we have about 40 clusters)
@cbron We can reproduce this by obtaining logs from any pod with a large amount of logs.
@beatsandpics exactly:
kubectl rollout restart -n=cattle-system deployment rancher
BTW it happened to us today. So ~4 days since the last restart. We set up monitoring and alerting using kube-prometheus probes to get alerted as soon as it pops up again. this is the URL we continuously check: https://rancher.mycompany.com/k8s/clusters/c-clusterid/api?timeout=32s
I just found the issue you all created, and believe the issue I commented on/joined (https://github.com/rancher/rancher/issues/37174) is related. Do any of you notice that kubectl shell is no longer working once you’ve upgraded from Rancher v2.6.2 to v2.6.4? Can’t use the kubectl tool from the menu bar, nor can you do “execute shell” to connect to any pods.
the same issue… =/ one of the 3 provisioned cluster become partly unavailable. not working Cluster Summary Page, external link to Longhorn UI , to Prometheus UI with proxy uRL