rancher: Cluster goes to unavailable because failed to communicate with API server: i/o timeout

What kind of request is this (question/bug/enhancement/feature request): QUESTION/BUG

Steps to reproduce (least amount of steps as possible): IT HAPPENS RANDOMLY

Result: I GET FROM RANCHER BELOW MESSAGE RANDOMLY AND I LOSE CONNECTIVITY WITH A K8S CLUSTER: Cluster health check failed: Failed to communicate with API server: Get “https://X.X.X.X:6443/api/v1/namespaces/kube-system?timeout=45s”: write tcp X.X.X.X:443->X.X.X.X:58260: i/o timeout

Other details that may be helpful: INSIDE RANCHER NODE, I SEE BELOW ERROR: 2021/01/22 20:54:33 [ERROR] error syncing ‘u-isxe45ee6z’: handler cat-user-attribute-controller: Put “https://X.X.X.X:6443/apis/cluster.cattle.io/v3/namespaces/cattle-system/clusteruserattributes/u-isxe45ee6z”: write tcp X.X.X.X:443->X.X.X.X:44508: i/o timeout, requeuing

INSIDE A WORKER NODE, I SEE BELOW ERROR: http: proxy error: write tcp 172.17.0.2:443 -> X.X.X.X:41234: write: connection reset by peer

AFTER A FEW MINUTES, IT CONNECTS AGAIN. IN SOME CASES, THE CONNECTION WAS RESTABLISHED BY RESTARTING THE CATTLE AGENT CONTAINER OR BY DELETING IT (IT WAS REBUILT AUTOMATICALLY).

Environment information

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): v2.5.1
Installation option (single install/HA): SINGLE INSTALL

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported): VMWARE HOSTED
Machine type (cloud/VM/metal) and specifications (CPU/memory): VMs Master Nodes: 2 Cores - 8 GB RAM Worker Nodes: 8 Cores - 32 GB RAM
Kubernetes version (use kubectl version):

(paste the output here)

Docker version (use docker version): Docker version 17.03.2-ce, build f5ec1e2

(paste the output here)

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 31
Comments: 60

Most upvoted comments

I’ve upgraded to 2.5.11 today, will keep you posted with the findings

AlessioCasco on Nov 5, 2021

Same with 2.5.8. Fixed by deleting cattle-node-agent.

timmy59100 on Jun 1, 2021

Hello everyone! I got some news from rancher: They told me that they have now identified an issue with the same behaviour in another customer and a fix for that issue is scheduled into v2.5.12, which is currently targeted for a mid-January release. I’ve asked if there is an idea on when this will be ported to v2.6.x

I’ll keep you posted as soon as I have news

AlessioCasco on Dec 17, 2021

Had the same issue today randomly with Rancher v2.5.9 and RKE v1.19.14-rancher1-1, VMware vSphere hosted. We did not reboot the etcd/Control Planes (HA x3), it fixed itself after ~10 minutes…

Client Version: version.Info

{Major:"1", Minor:"19", GitVersion:"v1.19.14", GitCommit:"0fd2b5afdfe3134d6e1531365fdb37dd11f54d1c", GitTreeState:"clean", BuildDate:"2021-08-11T18:07:41Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

Server Version: version.Info

{Major:"1", Minor:"19", GitVersion:"v1.19.14", GitCommit:"0fd2b5afdfe3134d6e1531365fdb37dd11f54d1c", GitTreeState:"clean", BuildDate:"2021-08-11T18:02:17Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

Master Nodes: 3x 4 vCores - 8 GiB RAM Worker Nodes: 10x 8 vCores - 64 GiB RAM Docker Version: Docker Engine 19.03.14 on Oracle Linux 8.3

ThoSap on Sep 17, 2021

Same problem using 2.5.8 and hetzner cloud provider. Any ideas?

thbiela on Jun 5, 2021

@admejia2 This is happening to me on Rancher v2.7.5 but only on a Kubernetes cluster I created recently. Your solution of deleting the token worked for me, with slightly modified commands due to rancher apparently being cattle on my system:

CATTLE_TOKEN=$(kubectl -n cattle-system get secret -o json | jq -r '.items[].metadata | select(.annotations."kubernetes.io/service-account.name" == "cattle") | .name')
kubectl -n cattle-system delete secret $CATTLE_TOKEN
kubectl rollout restart deployment cattle-cluster-agent -n cattle-system

manning-ncsa on Nov 21, 2023

This should be addressed by the same fix as https://github.com/rancher/rancher/issues/34819 with changes getting into 2.5.12 and 2.6.3-patch1.

Please note that the fix requires the new code to be running in both the Rancher server and the downstream cluster agents. After the Rancher server upgrade, the cluster agent for a time period will be running the old agent until they are upgraded. If you experience these errors during upgrade or shortly afterwards this is OK. However, if you see these errors after the initial stabilization passes, please let us know.

snasovich on Feb 4, 2022

Hello everyone! I got some news from rancher: They told me that they have now identified an issue with the same behaviour in another customer and a fix for that issue is scheduled into v2.5.12, which is currently targeted for a mid-January release. I’ve asked if there is an idea on when this will be ported to v2.6.x

I’ll keep you posted as soon as I have news

It’s mid-January and v2.5.12 hasn’t been released yet. Do you have any news about the new rancher release?

hyj-github on Jan 19, 2022

Hello! Some quick updates about this:

Rancher just communicated that the fix for this issue is still scheduled into the next Rancher v2.6 patch release (currently as v2.6.4 and scheduled for a March release) The issue itself was tracked and fixed in https://github.com/rancher/remotedialer/pull/36

AlessioCasco on Jan 18, 2022

Same with Rancher v2.5.9

kikolouro on Sep 15, 2021

I met the same issue with v2.4.5. How to increase the timeout between rancher and k8s API server? My timeout is 30s, I think it’s related to the response of API server, sometimes it works out and sometime error happened.

godocean on Apr 14, 2021

i had the same issue and could fixed, first step check resources of nodes if are sufficient (memory/cpu) after increased at least 8 Gb ram i could found real issue token expired of rancher.

kubectl -n cattle-system get secret -o json | jq -r ‘.items[].metadata | select(.annotations.“kubernetes.io/service-account.name” == “rancher”) | .name’ kubectl -n cattle-system delete secret rancher-token-xxxxxxx kubectl rollout restart deployment rancher -n cattle-system

admejia2 on Mar 3, 2023

I’m still seeing the behavior, albeit not nearly as often, in v2.5.12

junkiebev on Jan 28, 2022

Just wait a few minutes.

daleckystepan on Nov 30, 2021

Hello people!

As mentioned above, I got confirmation from rancher that 2.5.11 should have the issue fixed, I’ll try to upgrade in the next days and see if the problem is solved. Please share your findings after the upgrade.

@timmy59100 Nope, 2.5.10 doesn’t have anything related this issue unfortunately.

AlessioCasco on Nov 3, 2021

@AlessioCasco so far, everyone says that their connections get reestablished much quicker, so it all doesn’t get stuck for minutes. We hope it stays that way. If experiments are not prohibited in your environment, moving to the nginx ingress-controller might be a good workaround to try out. For the rest, let’s hope the upcoming bugfix on the rancher side will help to fully get rid of the connectivity issues.

weisdd on Oct 18, 2021

+1 with 2.5.5, also VMWARE hosted

gricuk on Mar 29, 2021

+1 same problem with v2.5.5 deployed in AWS

AlessioCasco on Mar 24, 2021