rancher: Cluster goes to unavailable because failed to communicate with API server: i/o timeout

What kind of request is this (question/bug/enhancement/feature request): QUESTION/BUG

Steps to reproduce (least amount of steps as possible): IT HAPPENS RANDOMLY

Result: I GET FROM RANCHER BELOW MESSAGE RANDOMLY AND I LOSE CONNECTIVITY WITH A K8S CLUSTER: Cluster health check failed: Failed to communicate with API server: Get “https://X.X.X.X:6443/api/v1/namespaces/kube-system?timeout=45s”: write tcp X.X.X.X:443->X.X.X.X:58260: i/o timeout

Other details that may be helpful: INSIDE RANCHER NODE, I SEE BELOW ERROR: 2021/01/22 20:54:33 [ERROR] error syncing ‘u-isxe45ee6z’: handler cat-user-attribute-controller: Put “https://X.X.X.X:6443/apis/cluster.cattle.io/v3/namespaces/cattle-system/clusteruserattributes/u-isxe45ee6z”: write tcp X.X.X.X:443->X.X.X.X:44508: i/o timeout, requeuing

INSIDE A WORKER NODE, I SEE BELOW ERROR: http: proxy error: write tcp 172.17.0.2:443 -> X.X.X.X:41234: write: connection reset by peer

AFTER A FEW MINUTES, IT CONNECTS AGAIN. IN SOME CASES, THE CONNECTION WAS RESTABLISHED BY RESTARTING THE CATTLE AGENT CONTAINER OR BY DELETING IT (IT WAS REBUILT AUTOMATICALLY).

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): v2.5.1
  • Installation option (single install/HA): SINGLE INSTALL

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): VMWARE HOSTED

  • Machine type (cloud/VM/metal) and specifications (CPU/memory): VMs Master Nodes: 2 Cores - 8 GB RAM Worker Nodes: 8 Cores - 32 GB RAM

  • Kubernetes version (use kubectl version):

(paste the output here)
  • Docker version (use docker version): Docker version 17.03.2-ce, build f5ec1e2
(paste the output here)

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 31
  • Comments: 60

Most upvoted comments

I’ve upgraded to 2.5.11 today, will keep you posted with the findings

Same with 2.5.8. Fixed by deleting cattle-node-agent.

Hello everyone! I got some news from rancher: They told me that they have now identified an issue with the same behaviour in another customer and a fix for that issue is scheduled into v2.5.12, which is currently targeted for a mid-January release. I’ve asked if there is an idea on when this will be ported to v2.6.x

I’ll keep you posted as soon as I have news

Had the same issue today randomly with Rancher v2.5.9 and RKE v1.19.14-rancher1-1, VMware vSphere hosted. We did not reboot the etcd/Control Planes (HA x3), it fixed itself after ~10 minutes…

Client Version: version.Info

{Major:"1", Minor:"19", GitVersion:"v1.19.14", GitCommit:"0fd2b5afdfe3134d6e1531365fdb37dd11f54d1c", GitTreeState:"clean", BuildDate:"2021-08-11T18:07:41Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

Server Version: version.Info

{Major:"1", Minor:"19", GitVersion:"v1.19.14", GitCommit:"0fd2b5afdfe3134d6e1531365fdb37dd11f54d1c", GitTreeState:"clean", BuildDate:"2021-08-11T18:02:17Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

Master Nodes: 3x 4 vCores - 8 GiB RAM Worker Nodes: 10x 8 vCores - 64 GiB RAM Docker Version: Docker Engine 19.03.14 on Oracle Linux 8.3

Same problem using 2.5.8 and hetzner cloud provider. Any ideas?

@admejia2 This is happening to me on Rancher v2.7.5 but only on a Kubernetes cluster I created recently. Your solution of deleting the token worked for me, with slightly modified commands due to rancher apparently being cattle on my system:

CATTLE_TOKEN=$(kubectl -n cattle-system get secret -o json | jq -r '.items[].metadata | select(.annotations."kubernetes.io/service-account.name" == "cattle") | .name')
kubectl -n cattle-system delete secret $CATTLE_TOKEN
kubectl rollout restart deployment cattle-cluster-agent -n cattle-system

This should be addressed by the same fix as https://github.com/rancher/rancher/issues/34819 with changes getting into 2.5.12 and 2.6.3-patch1.

Please note that the fix requires the new code to be running in both the Rancher server and the downstream cluster agents. After the Rancher server upgrade, the cluster agent for a time period will be running the old agent until they are upgraded. If you experience these errors during upgrade or shortly afterwards this is OK. However, if you see these errors after the initial stabilization passes, please let us know.

Hello everyone! I got some news from rancher: They told me that they have now identified an issue with the same behaviour in another customer and a fix for that issue is scheduled into v2.5.12, which is currently targeted for a mid-January release. I’ve asked if there is an idea on when this will be ported to v2.6.x

I’ll keep you posted as soon as I have news

It’s mid-January and v2.5.12 hasn’t been released yet. Do you have any news about the new rancher release?

Hello! Some quick updates about this:

Rancher just communicated that the fix for this issue is still scheduled into the next Rancher v2.6 patch release (currently as v2.6.4 and scheduled for a March release) The issue itself was tracked and fixed in https://github.com/rancher/remotedialer/pull/36

Same with Rancher v2.5.9

I met the same issue with v2.4.5. How to increase the timeout between rancher and k8s API server? My timeout is 30s, I think it’s related to the response of API server, sometimes it works out and sometime error happened.

i had the same issue and could fixed, first step check resources of nodes if are sufficient (memory/cpu) after increased at least 8 Gb ram i could found real issue token expired of rancher.

kubectl -n cattle-system get secret -o json | jq -r ‘.items[].metadata | select(.annotations.“kubernetes.io/service-account.name” == “rancher”) | .name’ kubectl -n cattle-system delete secret rancher-token-xxxxxxx kubectl rollout restart deployment rancher -n cattle-system

I’m still seeing the behavior, albeit not nearly as often, in v2.5.12

Just wait a few minutes.

Hello people!

As mentioned above, I got confirmation from rancher that 2.5.11 should have the issue fixed, I’ll try to upgrade in the next days and see if the problem is solved. Please share your findings after the upgrade.

@timmy59100 Nope, 2.5.10 doesn’t have anything related this issue unfortunately.

@AlessioCasco so far, everyone says that their connections get reestablished much quicker, so it all doesn’t get stuck for minutes. We hope it stays that way. If experiments are not prohibited in your environment, moving to the nginx ingress-controller might be a good workaround to try out. For the rest, let’s hope the upcoming bugfix on the rancher side will help to fully get rid of the connectivity issues.

+1 with 2.5.5, also VMWARE hosted

+1 same problem with v2.5.5 deployed in AWS