rancher: ETCD high traffic, RAM + timeouts after migration to 2.5.2
What kind of request is this (question/bug/enhancement/feature request): bug
Steps to reproduce (least amount of steps as possible): Migrated Rancher docker install from 2.4.6 2.5.2.
Result: ETCD generates a lot more traffic, needs more RAM and generates timeouts.
Other details that may be helpful: This is an issue on both clusters on this manager.
In etcd log:
etcdserver: read-only range request.... took too long (130.571166ms) to execute
Environment information
- Rancher version (
rancher/rancher
/rancher/server
image tag or shown bottom left in the UI): 2.5.2 - Installation option (single install/HA): single
Cluster information
- Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
- Machine type (cloud/VM/metal) and specifications (CPU/memory): cloud, 3x etcd/controlplane nodes: 4 cores 8GB RAM, 5x worker nodes 4cores 16GB RAM
- Kubernetes version (use
kubectl version
): 1.18.3
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:43:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
- Docker version (use
docker version
):
Client: Docker Engine - Community
Version: 19.03.8
API version: 1.40
Go version: go1.12.17
Git commit: afacb8b
Built: Wed Mar 11 01:27:04 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.8
API version: 1.40 (minimum version 1.12)
Go version: go1.12.17
Git commit: afacb8b
Built: Wed Mar 11 01:25:42 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.13
GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
gz#14921 gz#15264 gz#15266 gz#15712
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 5
- Comments: 26 (7 by maintainers)
Same here after upgrade from 2.4.11 to 2.5.5. Master node with less than 8GB RAM has OOM many times. Increasing ram to 12GB fix OOM problems. Also traffic on master node’s VLAN grow up
After downgrade to 2.4.6 all metrics are normal again…
Is there any proper fix for this issue? We are still facing this even with rancher 2.5.8. It happen on one node after another, endless cycle.
@adampl That isn’t really an option for multiple reasons:
Rancher version:
v2.5-head (02/24/2021)
87006deNormal network activity is observed in this 2.5-head build and Memory/CPU usage isn’t noticeably spiking as observed previously in 2.5.5.
@timmy59100 Did you rollback using an etcd snapshot ?
For us trying to revert rancher version through helm 3 without restoring etcd snapshot didn’t change significaly the RAM used, but the RAM suddenly dropped at the timeframe where we cleaned old helm2 releases with plugin helm-2to3 through
helm 2to3 cleanup
.I don’t see any direct link between these 2 things but maybe this can make sense with some rancher internals ?