rancher: [2.4.3] Frequent backed up reader errors when using Imported cluster (EKS)

What kind of request is this : bug

Steps to reproduce:

  1. Import an EKS cluster
  2. Keep using it for sometime (view logs, execute shell, switch projects etc)

Result:

  1. Results in the page getting stuck and error notification coming up
Get https://10.100.0.1:443/api/v1/namespaces?timeout=30s: backed up reader

Environment information

  • Rancher version: v2.4.2 and v2.4.3
  • UI version: v2.4.14 and v2.4.22
  • Installation option: single install

Cluster information

  • Cluster type: Imported
  • Machine type: AWS EKS + 3 Linux Nodes + 2 Windows Nodes
  • Kubernetes version:
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-af3caf", GitCommit:"af3caf6136cd355f467083651cc1010a499f59b1", GitTreeState:"clean", BuildDate:"2020-03-27T21:51:36Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

gzrancher/rancher#10302

gzrancher/rancher#10607

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 19
  • Comments: 53 (16 by maintainers)

Most upvoted comments

We are seeing lots of UI timeout errors after upgrading to 2.4.2. Same situation as above, imported EKS clusters running on 1.15.

Alright, new findings:

The library responsible for giving out the “Backed up reader” errors is https://github.com/rancher/remotedialer . The way I understand it is that it’s responsible for tunnelling the calls to the kube-apiserver (like kubectl proxy; correct me if I’m wrong). Its default timeout is 15 seconds (connection.go), but can be manipulated by setting the REMOTEDIALER_BACKUP_TIMEOUT_SECONDS environment variable on the Rancher docker container.

Having done that, it seems like this masking the actual issue. I’m now getting errors like these:

rancher_1  | 2020/05/22 08:49:12 [INFO] Failed to watch system component, error: Get https://100.64.0.1:443/api/v1/componentstatuses?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Also getting these when using the Web UI. The EOF errors I’ve had previously, also persist (In that case, the only fix was still to restart Rancher).

Anyway, these timeouts look to me like kube-apiserver is reacting either slowly or not returning anything because something crashes. Looking at the kube-apiserver logs, it seems like it might be the latter:

I0522 08:43:01.501391       1 log.go:172] http2: panic serving 172.20.40.252:41952: runtime error: invalid memory address or nil pointer dereference
goroutine 254935293 [running]:
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1(0xc023d96660)
        /workspace/anago-v1.16.9-beta.0.49+25599b5adea930/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:108 +0x107
panic(0x3bed560, 0x73b5700)
        /usr/local/go/src/runtime/panic.go:679 +0x1b2

This stack trace goes on for some time.

While this is not optimal, a google search seems to suggest that that’s merely a symptom as well. Specifically a symptom of the HTTP handler timing out. Someone had it with etcd tcp socket exhaustion on their machine, but that doesn’t seem to be case for me.

Still, I will look into possible etcd performance issues. Will keep this issue updated.

After speaking to @ibuildthecloud it turns out that that the issue is related to our transition to http2, will put in a revert. A quickfix could be to restarted rancher with the environment variable DISABLE_HTTP2=true.

We are also seeing this in imported GKE clusters. On rancher v2.4.4

i have the same problem, my inported k8s cluster version is 1.14.9

Hit the issue on an upgrade from 2.3.8 to 2.4.4 and 2.4.5-rc2

Steps:

  • Have an imported EKS cluster in rancher:v2.3.8 (k8s 1.16, 8 worker nodes)
  • Deploy redis apps - filled up 123/136 pods
  • Upgrade to 2.4.4/2.4.5-rc2
  • Cluster goes into Unavailable state with error - Cluster health check failed: Failed to communicate with API server: Get https://10.100.0.1:443/api/v1/namespaces/kube-system?timeout=30s: EOF

Just upgraded to v2.4.5. The issue is fixed 🎉 Thank you rancher team 😄

Same here with an imported rke created cluster. The Workararound disabling the HTTP2 works for this scenario as well.

@rmweir created ~10 screenshots of first the page call erroring out and then the page call taking ~1 minute to succeed, but eventually succeeding. They’re pretty large and it would be a lot to censor, so could I send it to you via mail or the Rancher Users Slack or something?

I am also experiencing this issue. Rancher 2.4.3

We also had to rollback. We can’t use 2.4 until this is resolved.