rancher: Cluster explorer and websocket connection broken after upgrading to rancher 2.5.8 (fine with 2.5.7)

SURE-3499 SURE-3469

What kind of request is this (question/bug/enhancement/feature request): Bug/Regression

Steps to reproduce (least amount of steps as possible):

  1. Upgrade from 2.5.7 to 2.5.8
  2. Go to cluster explorer of Rancher local cluster. It’s all fine
  3. Go to cluster explorer of downstream cluster: There’s a second when in a flash I can see the dashboard, and then I get logged out and redirected to the login page

It seems that the websocket connection is not established and that’s when the redirect happens.

RANCHER 2.5.8 Right after upgrading, the rancher pod in the rancher upstream cluster logs each time I try to access the cluster explorer:

[ERROR] Error during subscribe websocket: close sent

In the downstream cluster, cattle node agents report:

time="2021-05-28T09:10:12Z" level=info msg="Starting plan monitor, checking every 120 seconds"
time="2021-05-28T09:45:25Z" level=error msg="Remotedialer proxy error" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2021-05-28T09:45:35Z" level=info msg="Connecting to wss://my.rancher.host/v3/connect with token xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
time="2021-05-28T09:45:35Z" level=info msg="Connecting to proxy" url="wss://my.rancher.host/v3/connect"

and the cattle cluster agent:

time="2021-05-28T09:40:14Z" level=error msg="Error during subscribe websocket: close sent"
time="2021-05-28T09:42:10Z" level=error msg="Error during subscribe websocket: close sent"
W0528 09:42:48.617883 39 warnings.go:80] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
time="2021-05-28T09:43:58Z" level=error msg="Error during subscribe websocket: close sent"
time="2021-05-28T09:45:25Z" level=error msg="Remotedialer proxy error" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2021-05-28T09:45:35Z" level=info msg="Connecting to wss://my.rancher.host/v3/connect with token xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
time="2021-05-28T09:45:35Z" level=info msg="Connecting to proxy" url="wss://my.rancher.host/v3/connect"
W0528 09:47:25.012488 39 warnings.go:80] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
time="2021-05-28T09:50:30Z" level=error msg="Error during subscribe websocket: close sent"

RANCHER 2.5.7

Upstream and downstream are working fine. Rancher UI working fine too

When the cattle cluster agent starts it logs no errors

time="2021-05-28T09:53:52Z" level=info msg="Connecting to wss://my.rancher.host/v3/connect/register with token xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
time="2021-05-28T09:53:52Z" level=info msg="Connecting to proxy" url="wss://my.rancher.host/v3/connect/register"
time="2021-05-28T09:53:52Z" level=info msg="Starting user controllers"`

When each cattle node agent pod is started they log something like:

time="2021-05-28T09:53:52Z" level=info msg="Connecting to wss://my.rancher.host/v3/connect/register with token xxxxxxxxxxxxxxxxxxxxxxxxxxx"
time="2021-05-28T09:53:52Z" level=info msg="Connecting to proxy" url="wss://my.rancher.host/v3/connect/register"
time="2021-05-28T09:53:52Z" level=info msg="Starting user controllers"

Neither of them logs any errors

Result:

I have to rollback to 2.5.7 if I want to carry on using rancher

Other details that may be helpful:

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.5.8
  • Installation option (single install/HA): single install with helm

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Hosted

  • Machine type (cloud/VM/metal) and specifications (CPU/memory): Cloud

  • Kubernetes version (use kubectl version): v1.20.7+rke2r2

  • Docker version (use docker version):

20.10.3

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 3
  • Comments: 61 (8 by maintainers)

Most upvoted comments

the issue still on v2.6.4

same here on v2.6.5

I have the same issue when upgrading from Rancher 2.5 to 2.7.5. I delete all cattle namespaces on EKS and apply the yaml again., still encountered this problem.

@maradwan same here

I have made a new issue to raise awareness for this, since this issue is closed en we still deal with the problem. https://github.com/rancher/rancher/issues/38931

Please feel free to add comments to the new issue if you have the same or similar problems with the websockets, so it hopefully gets more priority. Thnx!

I also have wss connection issue notifications popping up on 2.6.7.

The message is:

The connection to wss://my.rancher.host/v1/subscribe closed unexpectedly Thu, Aug 25 2022 11:17:06 am. Retrying...
Websocket Disconnected
The connection to wss://my.rancher.host/v1/subscribe closed unexpectedly 

And in the logs it tels me:

Error during subscribe websocket: close sent
Error during subscribe write tcp 172.17.0.3:443->xxx.xxx.xxx.xxx:52499: write: broken pipe

Is there a solution?

I am also having this issue in 2.6.7

i think i found it

did you guys perhaps disable anonymous auth in grafana?

  grafana.ini:
    auth:
      disable_login_form: false
    auth.anonymous:
      enabled: false
      org_role: Viewer

obviously rancher explorer gets the metrics from rancher-monitoring grafana, so it will get a 401 from grafana when loading the dashboard and that ripps the websocket

edit: if you need to expose grafana and thats why you have anonymous auth disabled, the only workaround i can think of is create a dummy org with no datasources and set

  grafana.ini:
    auth:
      disable_login_form: false
    auth.anonymous:
      enabled: true
      org_role: Viewer
      org_name: dummy

You wont have metric grafs in your explorer but it will work A fix would be for explorer/websocket to ignore the 401 from grafana and just not load the grafana iframes if 401 is received

we are seeing this on 1.19.x downstream clusters after upgrading to rancher 2.5.9 all 1.20.x are fine here As soon as i switch to a 1.19.x cluster in explorer it kicks me to the login prompt

edit1: it is obviously an issue with the “Cluster Dashboard” i mean the initial landing page of a downstream cluster. If i call ie. https://rancher.blah/dashboard/c/cluster-id/explorer/namespace directly it works normally, as soon as i switch to “Cluster Dashboard” it kicks me out of there.

Tomorrow i wil upgrade some of our downstream clusters to 1.20.x lets see if something changes.

For us the situation was caused because of unusually high load on rancher which was resolved through #38804. This is probably only relevant if you’re using ActiveDirectory as authentication provider.

Having problems on stable version: 2.6.8

We are also facing this issue on 2.6.8

@decimalator This helped. I had renamed the grafana org to something else. When I changed back to default of Main Org. it started working again. Don’t know how they are related - but it’s fixed.

In my case, I installed it with the bash command curl -sfL https://get.rke2.io | sh -. However, I upgraded it using the Rancher cluster edit tool