rancher: Rancher 2 Master gets OOM killed
**What kind of request is this (question/bug/enhancement/feature request): Bug
**Steps to reproduce (least amount of steps as possible): The rancher master does take about 1.75 GB/s RAM per hour and get killed by the kernel OOM. Only the master is affected, other rancher pods running have a normal memory usage. We already upgraded to 16 GB RAM, but the issue still persists.
We guess it could have been related to cluster-monitoring, as we find data in memory dumps done with GDB. Is the management layer caching cluster-monitoring data?
/proc/PID/smaps:
+c084000000-c08c000000 rw-p 00000000 00:00 0
+Size: 131072 kB
+Rss: 131072 kB
+Pss: 131072 kB
+Shared_Clean: 0 kB
+Shared_Dirty: 0 kB
+Private_Clean: 0 kB
+Private_Dirty: 131072 kB
+Referenced: 131072 kB
+Anonymous: 131072 kB
+AnonHugePages: 0 kB
+Shared_Hugetlb: 0 kB
+Private_Hugetlb: 0 kB
+Swap: 0 kB
+SwapPss: 0 kB
+KernelPageSize: 4 kB
+MMUPageSize: 4 kB
+Locked: 0 kB
+VmFlags: rd wr mr mw me ac sd
{"metadata":{"name":"cluster-monitoring.v341","namespace":"cattle-prometheus","selfLink":"/api/v1/namespaces/cattle-prometheus/configmaps/cluster-monitoring.v341","uid":"de760c13-7bb7-11e9-b87e-005056832c45","resourceVersion":"884046","creationTimestamp":"2019-05-21T11:02:05Z","labels":{"MODIFIED_AT":"1558437104","NAME":"cluster-monitoring","OWNER":"TILLER","STATUS":"SUPERSEDED","VERSION":"340"}},"data":{"release":"...
Are there any suggestions, how to find the sub component in Rancher allocating the memory?
Environment information
- Rancher version: v2.2.4
- Installation option (single install/HA): HA
Cluster information
-
Cluster type: Imported rke cluster
-
Machine specifications (CPU/memory): 6 vCPUs, 16 GB RAM
-
Docker version:
# docker version
Client:
Version: 18.09.7
API version: 1.39
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:56:17 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.7
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:23:02 2019
OS/Arch: linux/amd64
Experimental: false
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (5 by maintainers)
I’m hitting this issue when we use v2.3.2 and I investigated root cause deeply and after I identified root cause, I checked v2.4.2 code whether problem still exists or not. At that time I found https://github.com/rancher/norman/commit/6269ccdbeace958aa76ec92f7d4c42c442459bb1 this patch, and this patch solved what we encountered completely, but I found another reason of goroutine leak in latest (v2.4.2) rancher .
Let me explain the goroutine leak case for each version(v2.3.2, v2.4.2) I faced/investigated
v2.3.2
This was the scenario to cause goroutine leak in our case.
For this case, thanks to https://github.com/rancher/norman/commit/6269ccdbeace958aa76ec92f7d4c42c442459bb1 patch, we could solve since workqueue object will be created when Sync() is evaluated which is part of Start() instead of when GenericController object is created .
v2.4.2
I thought above issue is completely gone, but I hit memory leak again with same condition in rancher/rancher:v2.4.2, so I investigate again, and I found that currently ClusterManager use https://github.com/rancher/wrangler for RBAC cache, and this cause memory leak as followings
Condition
How leak is happened (this is very similar to what I explained in above)
Point of above procedure
Thanks for the awesome investigation and write up @ukinau! We’re actively investigating this and this will save us a lot of time.
Install
Rancher:v2.4.2
in an RKE 3-node cluster, and add couple clusters to the setup, and “break” some of them. Here is the CPU and memory consumption of therancher
workload in the local cluster:Make another identical setup using the image
Rancher:v2.4-2774-head
cf5ab1d :here is another identical setup using
Rancher:master-2792-head
Ugh. This issue is so annoying. I can fix this tomorrow.