rancher: Cluster explorer and websocket connection broken after upgrading to rancher 2.5.8 (fine with 2.5.7)
SURE-3499 SURE-3469
What kind of request is this (question/bug/enhancement/feature request): Bug/Regression
Steps to reproduce (least amount of steps as possible):
- Upgrade from 2.5.7 to 2.5.8
- Go to cluster explorer of Rancher local cluster. It’s all fine
- Go to cluster explorer of downstream cluster: There’s a second when in a flash I can see the dashboard, and then I get logged out and redirected to the login page
It seems that the websocket connection is not established and that’s when the redirect happens.
RANCHER 2.5.8 Right after upgrading, the rancher pod in the rancher upstream cluster logs each time I try to access the cluster explorer:
[ERROR] Error during subscribe websocket: close sent
In the downstream cluster, cattle node agents report:
time="2021-05-28T09:10:12Z" level=info msg="Starting plan monitor, checking every 120 seconds"
time="2021-05-28T09:45:25Z" level=error msg="Remotedialer proxy error" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2021-05-28T09:45:35Z" level=info msg="Connecting to wss://my.rancher.host/v3/connect with token xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
time="2021-05-28T09:45:35Z" level=info msg="Connecting to proxy" url="wss://my.rancher.host/v3/connect"
and the cattle cluster agent:
time="2021-05-28T09:40:14Z" level=error msg="Error during subscribe websocket: close sent"
time="2021-05-28T09:42:10Z" level=error msg="Error during subscribe websocket: close sent"
W0528 09:42:48.617883 39 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
time="2021-05-28T09:43:58Z" level=error msg="Error during subscribe websocket: close sent"
time="2021-05-28T09:45:25Z" level=error msg="Remotedialer proxy error" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2021-05-28T09:45:35Z" level=info msg="Connecting to wss://my.rancher.host/v3/connect with token xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
time="2021-05-28T09:45:35Z" level=info msg="Connecting to proxy" url="wss://my.rancher.host/v3/connect"
W0528 09:47:25.012488 39 warnings.go:80] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
time="2021-05-28T09:50:30Z" level=error msg="Error during subscribe websocket: close sent"
RANCHER 2.5.7
Upstream and downstream are working fine. Rancher UI working fine too
When the cattle cluster agent starts it logs no errors
time="2021-05-28T09:53:52Z" level=info msg="Connecting to wss://my.rancher.host/v3/connect/register with token xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
time="2021-05-28T09:53:52Z" level=info msg="Connecting to proxy" url="wss://my.rancher.host/v3/connect/register"
time="2021-05-28T09:53:52Z" level=info msg="Starting user controllers"`
When each cattle node agent pod is started they log something like:
time="2021-05-28T09:53:52Z" level=info msg="Connecting to wss://my.rancher.host/v3/connect/register with token xxxxxxxxxxxxxxxxxxxxxxxxxxx"
time="2021-05-28T09:53:52Z" level=info msg="Connecting to proxy" url="wss://my.rancher.host/v3/connect/register"
time="2021-05-28T09:53:52Z" level=info msg="Starting user controllers"
Neither of them logs any errors
Result:
I have to rollback to 2.5.7 if I want to carry on using rancher
Other details that may be helpful:
Environment information
- Rancher version (
rancher/rancher
/rancher/server
image tag or shown bottom left in the UI): 2.5.8 - Installation option (single install/HA): single install with helm
Cluster information
-
Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Hosted
-
Machine type (cloud/VM/metal) and specifications (CPU/memory): Cloud
-
Kubernetes version (use
kubectl version
): v1.20.7+rke2r2 -
Docker version (use
docker version
):
20.10.3
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 3
- Comments: 61 (8 by maintainers)
the issue still on v2.6.4
same here on v2.6.5
I have the same issue when upgrading from Rancher 2.5 to 2.7.5. I delete all cattle namespaces on EKS and apply the yaml again., still encountered this problem.
@maradwan same here
I have made a new issue to raise awareness for this, since this issue is closed en we still deal with the problem. https://github.com/rancher/rancher/issues/38931
Please feel free to add comments to the new issue if you have the same or similar problems with the websockets, so it hopefully gets more priority. Thnx!
I am also having this issue in 2.6.7
i think i found it
did you guys perhaps disable anonymous auth in grafana?
obviously rancher explorer gets the metrics from rancher-monitoring grafana, so it will get a 401 from grafana when loading the dashboard and that ripps the websocket
edit: if you need to expose grafana and thats why you have anonymous auth disabled, the only workaround i can think of is create a dummy org with no datasources and set
You wont have metric grafs in your explorer but it will work A fix would be for explorer/websocket to ignore the 401 from grafana and just not load the grafana iframes if 401 is received
we are seeing this on 1.19.x downstream clusters after upgrading to rancher 2.5.9 all 1.20.x are fine here As soon as i switch to a 1.19.x cluster in explorer it kicks me to the login prompt
edit1: it is obviously an issue with the “Cluster Dashboard” i mean the initial landing page of a downstream cluster. If i call ie. https://rancher.blah/dashboard/c/cluster-id/explorer/namespace directly it works normally, as soon as i switch to “Cluster Dashboard” it kicks me out of there.
Tomorrow i wil upgrade some of our downstream clusters to 1.20.x lets see if something changes.
For us the situation was caused because of unusually high load on rancher which was resolved through #38804. This is probably only relevant if you’re using ActiveDirectory as authentication provider.
Having problems on stable version: 2.6.8
We are also facing this issue on 2.6.8
@decimalator This helped. I had renamed the grafana org to something else. When I changed back to default of Main Org. it started working again. Don’t know how they are related - but it’s fixed.
In my case, I installed it with the bash command
curl -sfL https://get.rke2.io | sh -
. However, I upgraded it using the Rancher cluster edit tool