rancher: [2.4.3] Frequent backed up reader errors when using Imported cluster (EKS)

What kind of request is this : bug

Steps to reproduce:

Import an EKS cluster
Keep using it for sometime (view logs, execute shell, switch projects etc)

Result:

Results in the page getting stuck and error notification coming up

Get https://10.100.0.1:443/api/v1/namespaces?timeout=30s: backed up reader

Environment information

Rancher version: v2.4.2 and v2.4.3
UI version: v2.4.14 and v2.4.22
Installation option: single install

Cluster information

Cluster type: Imported
Machine type: AWS EKS + 3 Linux Nodes + 2 Windows Nodes
Kubernetes version:

Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-af3caf", GitCommit:"af3caf6136cd355f467083651cc1010a499f59b1", GitTreeState:"clean", BuildDate:"2020-03-27T21:51:36Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

gzrancher/rancher#10302

gzrancher/rancher#10607

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 19
Comments: 53 (16 by maintainers)

Most upvoted comments

We are seeing lots of UI timeout errors after upgrading to 2.4.2. Same situation as above, imported EKS clusters running on 1.15.

joberdick on Apr 17, 2020

Alright, new findings:

The library responsible for giving out the “Backed up reader” errors is https://github.com/rancher/remotedialer . The way I understand it is that it’s responsible for tunnelling the calls to the kube-apiserver (like kubectl proxy; correct me if I’m wrong). Its default timeout is 15 seconds (connection.go), but can be manipulated by setting the REMOTEDIALER_BACKUP_TIMEOUT_SECONDS environment variable on the Rancher docker container.

Having done that, it seems like this masking the actual issue. I’m now getting errors like these:

rancher_1  | 2020/05/22 08:49:12 [INFO] Failed to watch system component, error: Get https://100.64.0.1:443/api/v1/componentstatuses?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Also getting these when using the Web UI. The EOF errors I’ve had previously, also persist (In that case, the only fix was still to restart Rancher).

Anyway, these timeouts look to me like kube-apiserver is reacting either slowly or not returning anything because something crashes. Looking at the kube-apiserver logs, it seems like it might be the latter:

I0522 08:43:01.501391       1 log.go:172] http2: panic serving 172.20.40.252:41952: runtime error: invalid memory address or nil pointer dereference
goroutine 254935293 [running]:
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1(0xc023d96660)
        /workspace/anago-v1.16.9-beta.0.49+25599b5adea930/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:108 +0x107
panic(0x3bed560, 0x73b5700)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2

This stack trace goes on for some time.

While this is not optimal, a google search seems to suggest that that’s merely a symptom as well. Specifically a symptom of the HTTP handler timing out. Someone had it with etcd tcp socket exhaustion on their machine, but that doesn’t seem to be case for me.

Still, I will look into possible etcd performance issues. Will keep this issue updated.

tobias-theobald on May 22, 2020

After speaking to @ibuildthecloud it turns out that that the issue is related to our transition to http2, will put in a revert. A quickfix could be to restarted rancher with the environment variable DISABLE_HTTP2=true.

rmweir on Jun 9, 2020

We are also seeing this in imported GKE clusters. On rancher v2.4.4

jonatan-b-kr on Jun 4, 2020

i have the same problem, my inported k8s cluster version is 1.14.9

wuzhihui1123 on Apr 24, 2020

Hit the issue on an upgrade from 2.3.8 to 2.4.4 and 2.4.5-rc2

Steps:

Have an imported EKS cluster in rancher:v2.3.8 (k8s 1.16, 8 worker nodes)
Deploy redis apps - filled up 123/136 pods
Upgrade to 2.4.4/2.4.5-rc2
Cluster goes into Unavailable state with error - Cluster health check failed: Failed to communicate with API server: Get https://10.100.0.1:443/api/v1/namespaces/kube-system?timeout=30s: EOF

sowmyav27 on Jun 6, 2020

Just upgraded to v2.4.5. The issue is fixed 🎉 Thank you rancher team 😄

niranjan94 on Jun 21, 2020

Same here with an imported rke created cluster. The Workararound disabling the HTTP2 works for this scenario as well.

ChrisHaPunkt on Jun 18, 2020

@rmweir created ~10 screenshots of first the page call erroring out and then the page call taking ~1 minute to succeed, but eventually succeeding. They’re pretty large and it would be a lot to censor, so could I send it to you via mail or the Rancher Users Slack or something?

tobias-theobald on Jun 1, 2020

rancher-26624-networking

shubb30 on Jun 1, 2020

I am also experiencing this issue. Rancher 2.4.3

drpebcak on May 28, 2020

We also had to rollback. We can’t use 2.4 until this is resolved.

joberdick on May 25, 2020