rancher: Cluster explorer and websocket connection broken after upgrading to rancher 2.5.8 (fine with 2.5.7)

SURE-3499 SURE-3469

What kind of request is this (question/bug/enhancement/feature request): Bug/Regression

Steps to reproduce (least amount of steps as possible):

Upgrade from 2.5.7 to 2.5.8
Go to cluster explorer of Rancher local cluster. It’s all fine
Go to cluster explorer of downstream cluster: There’s a second when in a flash I can see the dashboard, and then I get logged out and redirected to the login page

It seems that the websocket connection is not established and that’s when the redirect happens.

RANCHER 2.5.8 Right after upgrading, the rancher pod in the rancher upstream cluster logs each time I try to access the cluster explorer:

[ERROR] Error during subscribe websocket: close sent

In the downstream cluster, cattle node agents report:

time="2021-05-28T09:10:12Z" level=info msg="Starting plan monitor, checking every 120 seconds"
time="2021-05-28T09:45:25Z" level=error msg="Remotedialer proxy error" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2021-05-28T09:45:35Z" level=info msg="Connecting to wss://my.rancher.host/v3/connect with token xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
time="2021-05-28T09:45:35Z" level=info msg="Connecting to proxy" url="wss://my.rancher.host/v3/connect"

and the cattle cluster agent:

time="2021-05-28T09:40:14Z" level=error msg="Error during subscribe websocket: close sent"
time="2021-05-28T09:42:10Z" level=error msg="Error during subscribe websocket: close sent"
W0528 09:42:48.617883 39 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
time="2021-05-28T09:43:58Z" level=error msg="Error during subscribe websocket: close sent"
time="2021-05-28T09:45:25Z" level=error msg="Remotedialer proxy error" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2021-05-28T09:45:35Z" level=info msg="Connecting to wss://my.rancher.host/v3/connect with token xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
time="2021-05-28T09:45:35Z" level=info msg="Connecting to proxy" url="wss://my.rancher.host/v3/connect"
W0528 09:47:25.012488 39 warnings.go:80] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
time="2021-05-28T09:50:30Z" level=error msg="Error during subscribe websocket: close sent"

RANCHER 2.5.7

Upstream and downstream are working fine. Rancher UI working fine too

When the cattle cluster agent starts it logs no errors

time="2021-05-28T09:53:52Z" level=info msg="Connecting to wss://my.rancher.host/v3/connect/register with token xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
time="2021-05-28T09:53:52Z" level=info msg="Connecting to proxy" url="wss://my.rancher.host/v3/connect/register"
time="2021-05-28T09:53:52Z" level=info msg="Starting user controllers"`

When each cattle node agent pod is started they log something like:

time="2021-05-28T09:53:52Z" level=info msg="Connecting to wss://my.rancher.host/v3/connect/register with token xxxxxxxxxxxxxxxxxxxxxxxxxxx"
time="2021-05-28T09:53:52Z" level=info msg="Connecting to proxy" url="wss://my.rancher.host/v3/connect/register"
time="2021-05-28T09:53:52Z" level=info msg="Starting user controllers"

Neither of them logs any errors

Result:

I have to rollback to 2.5.7 if I want to carry on using rancher

Other details that may be helpful:

Environment information

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.5.8
Installation option (single install/HA): single install with helm

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Hosted
Machine type (cloud/VM/metal) and specifications (CPU/memory): Cloud
Kubernetes version (use kubectl version): v1.20.7+rke2r2
Docker version (use docker version):

20.10.3

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 3
Comments: 61 (8 by maintainers)

Most upvoted comments

the issue still on v2.6.4

+15

maradwan on Apr 14, 2022

same here on v2.6.5

Terry-Basin on May 23, 2022

I have the same issue when upgrading from Rancher 2.5 to 2.7.5. I delete all cattle namespaces on EKS and apply the yaml again., still encountered this problem.

zhang8473 on Jul 18, 2023

@maradwan same here

sh0umik on May 12, 2022

I have made a new issue to raise awareness for this, since this issue is closed en we still deal with the problem. https://github.com/rancher/rancher/issues/38931

Please feel free to add comments to the new issue if you have the same or similar problems with the websockets, so it hopefully gets more priority. Thnx!

rjbaat on Sep 15, 2022

I also have wss connection issue notifications popping up on 2.6.7.

The message is:

The connection to wss://my.rancher.host/v1/subscribe closed unexpectedly Thu, Aug 25 2022 11:17:06 am. Retrying...
Websocket Disconnected
The connection to wss://my.rancher.host/v1/subscribe closed unexpectedly

And in the logs it tels me:

Error during subscribe websocket: close sent
Error during subscribe write tcp 172.17.0.3:443->xxx.xxx.xxx.xxx:52499: write: broken pipe

Is there a solution?

I am also having this issue in 2.6.7

BMeach on Aug 30, 2022

i think i found it

did you guys perhaps disable anonymous auth in grafana?

  grafana.ini:
    auth:
      disable_login_form: false
    auth.anonymous:
      enabled: false
      org_role: Viewer

obviously rancher explorer gets the metrics from rancher-monitoring grafana, so it will get a 401 from grafana when loading the dashboard and that ripps the websocket

edit: if you need to expose grafana and thats why you have anonymous auth disabled, the only workaround i can think of is create a dummy org with no datasources and set

  grafana.ini:
    auth:
      disable_login_form: false
    auth.anonymous:
      enabled: true
      org_role: Viewer
      org_name: dummy

You wont have metric grafs in your explorer but it will work A fix would be for explorer/websocket to ignore the 401 from grafana and just not load the grafana iframes if 401 is received

flostru on Aug 13, 2021

we are seeing this on 1.19.x downstream clusters after upgrading to rancher 2.5.9 all 1.20.x are fine here As soon as i switch to a 1.19.x cluster in explorer it kicks me to the login prompt

edit1: it is obviously an issue with the “Cluster Dashboard” i mean the initial landing page of a downstream cluster. If i call ie. https://rancher.blah/dashboard/c/cluster-id/explorer/namespace directly it works normally, as soon as i switch to “Cluster Dashboard” it kicks me out of there.

Tomorrow i wil upgrade some of our downstream clusters to 1.20.x lets see if something changes.

flostru on Aug 11, 2021

For us the situation was caused because of unusually high load on rancher which was resolved through #38804. This is probably only relevant if you’re using ActiveDirectory as authentication provider.

franznemeth on Sep 29, 2022

Having problems on stable version: 2.6.8

gabrielcandrade on Sep 10, 2022

We are also facing this issue on 2.6.8

franznemeth on Sep 2, 2022

@decimalator This helped. I had renamed the grafana org to something else. When I changed back to default of Main Org. it started working again. Don’t know how they are related - but it’s fixed.

aggiering on Aug 26, 2021

In my case, I installed it with the bash command curl -sfL https://get.rke2.io | sh -. However, I upgraded it using the Rancher cluster edit tool

cortopy on Jul 30, 2021