rancher: ETCD high traffic, RAM + timeouts after migration to 2.5.2

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible): Migrated Rancher docker install from 2.4.6 2.5.2.

Result: ETCD generates a lot more traffic, needs more RAM and generates timeouts.

Other details that may be helpful: This is an issue on both clusters on this manager.

In etcd log:

etcdserver: read-only range request.... took too long (130.571166ms) to execute

Environment information

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.5.2
Installation option (single install/HA): single

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
Machine type (cloud/VM/metal) and specifications (CPU/memory): cloud, 3x etcd/controlplane nodes: 4 cores 8GB RAM, 5x worker nodes 4cores 16GB RAM
Kubernetes version (use kubectl version): 1.18.3

Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:43:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Docker version (use docker version):

Client: Docker Engine - Community
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        afacb8b
 Built:             Wed Mar 11 01:27:04 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       afacb8b
  Built:            Wed Mar 11 01:25:42 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

gz#14921 gz#15264 gz#15266 gz#15712

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 5
Comments: 26 (7 by maintainers)

Most upvoted comments

Same here after upgrade from 2.4.11 to 2.5.5. Master node with less than 8GB RAM has OOM many times. Increasing ram to 12GB fix OOM problems. Also traffic on master node’s VLAN grow up Screenshot 2021-02-12 at 12 06 36

intonet on Feb 12, 2021

After downgrade to 2.4.6 all metrics are normal again…

timmy59100 on Nov 23, 2020

Is there any proper fix for this issue? We are still facing this even with rancher 2.5.8. It happen on one node after another, endless cycle.

onnet-goey on Jun 28, 2021

@adampl That isn’t really an option for multiple reasons:

Rancher is not in charge of etcd, the provisioner is.
We don’t know how to talk to etcd and in a lot of cases can’t. A good example is a hosted cluster in AKS or GKE.
That would be unexpected behavior of doing a rancher upgrade. We don’t cause changes to etcd unless it’s explicit i.e. you as a user change the k8s version of your cluster which would (possibly) update the version of etcd.

dramich on Feb 26, 2021

Rancher version: v2.5-head (02/24/2021) 87006de

Normal network activity is observed in this 2.5-head build and Memory/CPU usage isn’t noticeably spiking as observed previously in 2.5.5.

Screen Shot 2021-02-24 at 2 29 42 PM

izaac on Feb 24, 2021

@timmy59100 Did you rollback using an etcd snapshot ?

For us trying to revert rancher version through helm 3 without restoring etcd snapshot didn’t change significaly the RAM used, but the RAM suddenly dropped at the timeframe where we cleaned old helm2 releases with plugin helm-2to3 through helm 2to3 cleanup.

I don’t see any direct link between these 2 things but maybe this can make sense with some rancher internals ?

fredleger on Feb 8, 2021