rancher: ETCD high traffic, RAM + timeouts after migration to 2.5.2

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible): Migrated Rancher docker install from 2.4.6 2.5.2.

Result: ETCD generates a lot more traffic, needs more RAM and generates timeouts.

Other details that may be helpful: This is an issue on both clusters on this manager.

In etcd log:

etcdserver: read-only range request.... took too long (130.571166ms) to execute

image

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.5.2
  • Installation option (single install/HA): single

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): cloud, 3x etcd/controlplane nodes: 4 cores 8GB RAM, 5x worker nodes 4cores 16GB RAM
  • Kubernetes version (use kubectl version): 1.18.3
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:43:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
  • Docker version (use docker version):
Client: Docker Engine - Community
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        afacb8b
 Built:             Wed Mar 11 01:27:04 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       afacb8b
  Built:            Wed Mar 11 01:25:42 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

gz#14921 gz#15264 gz#15266 gz#15712

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 5
  • Comments: 26 (7 by maintainers)

Most upvoted comments

Same here after upgrade from 2.4.11 to 2.5.5. Master node with less than 8GB RAM has OOM many times. Increasing ram to 12GB fix OOM problems. Also traffic on master node’s VLAN grow up Screenshot 2021-02-12 at 12 06 36

After downgrade to 2.4.6 all metrics are normal again…

image

Is there any proper fix for this issue? We are still facing this even with rancher 2.5.8. It happen on one node after another, endless cycle.

@adampl That isn’t really an option for multiple reasons:

  1. Rancher is not in charge of etcd, the provisioner is.
  2. We don’t know how to talk to etcd and in a lot of cases can’t. A good example is a hosted cluster in AKS or GKE.
  3. That would be unexpected behavior of doing a rancher upgrade. We don’t cause changes to etcd unless it’s explicit i.e. you as a user change the k8s version of your cluster which would (possibly) update the version of etcd.

Rancher version: v2.5-head (02/24/2021) 87006de

Normal network activity is observed in this 2.5-head build and Memory/CPU usage isn’t noticeably spiking as observed previously in 2.5.5.

Screen Shot 2021-02-24 at 2 29 42 PM

@timmy59100 Did you rollback using an etcd snapshot ?

For us trying to revert rancher version through helm 3 without restoring etcd snapshot didn’t change significaly the RAM used, but the RAM suddenly dropped at the timeframe where we cleaned old helm2 releases with plugin helm-2to3 through helm 2to3 cleanup.

I don’t see any direct link between these 2 things but maybe this can make sense with some rancher internals ?