rancher: Rancher 2 Master gets OOM killed

**What kind of request is this (question/bug/enhancement/feature request): Bug

**Steps to reproduce (least amount of steps as possible): The rancher master does take about 1.75 GB/s RAM per hour and get killed by the kernel OOM. Only the master is affected, other rancher pods running have a normal memory usage. We already upgraded to 16 GB RAM, but the issue still persists.

We guess it could have been related to cluster-monitoring, as we find data in memory dumps done with GDB. Is the management layer caching cluster-monitoring data?

/proc/PID/smaps:
+c084000000-c08c000000 rw-p 00000000 00:00 0 
+Size:             131072 kB
+Rss:              131072 kB
+Pss:              131072 kB
+Shared_Clean:          0 kB
+Shared_Dirty:          0 kB
+Private_Clean:         0 kB
+Private_Dirty:    131072 kB
+Referenced:       131072 kB
+Anonymous:        131072 kB
+AnonHugePages:         0 kB
+Shared_Hugetlb:        0 kB
+Private_Hugetlb:       0 kB
+Swap:                  0 kB
+SwapPss:               0 kB
+KernelPageSize:        4 kB
+MMUPageSize:           4 kB
+Locked:                0 kB
+VmFlags: rd wr mr mw me ac sd 

{"metadata":{"name":"cluster-monitoring.v341","namespace":"cattle-prometheus","selfLink":"/api/v1/namespaces/cattle-prometheus/configmaps/cluster-monitoring.v341","uid":"de760c13-7bb7-11e9-b87e-005056832c45","resourceVersion":"884046","creationTimestamp":"2019-05-21T11:02:05Z","labels":{"MODIFIED_AT":"1558437104","NAME":"cluster-monitoring","OWNER":"TILLER","STATUS":"SUPERSEDED","VERSION":"340"}},"data":{"release":"...

Are there any suggestions, how to find the sub component in Rancher allocating the memory?

Environment information

  • Rancher version: v2.2.4
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type: Imported rke cluster

  • Machine specifications (CPU/memory): 6 vCPUs, 16 GB RAM

  • Docker version:

# docker version
Client:
 Version:           18.09.7
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        2d0083d
 Built:             Thu Jun 27 17:56:17 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.7
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       2d0083d
  Built:            Thu Jun 27 17:23:02 2019
  OS/Arch:          linux/amd64
  Experimental:     false

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

I’m hitting this issue when we use v2.3.2 and I investigated root cause deeply and after I identified root cause, I checked v2.4.2 code whether problem still exists or not. At that time I found https://github.com/rancher/norman/commit/6269ccdbeace958aa76ec92f7d4c42c442459bb1 this patch, and this patch solved what we encountered completely, but I found another reason of goroutine leak in latest (v2.4.2) rancher .

Let me explain the goroutine leak case for each version(v2.3.2, v2.4.2) I faced/investigated

v2.3.2

This was the scenario to cause goroutine leak in our case.

For this case, thanks to https://github.com/rancher/norman/commit/6269ccdbeace958aa76ec92f7d4c42c442459bb1 patch, we could solve since workqueue object will be created when Sync() is evaluated which is part of Start() instead of when GenericController object is created .


v2.4.2

I thought above issue is completely gone, but I hit memory leak again with same condition in rancher/rancher:v2.4.2, so I investigate again, and I found that currently ClusterManager use https://github.com/rancher/wrangler for RBAC cache, and this cause memory leak as followings

Thanks for the awesome investigation and write up @ukinau! We’re actively investigating this and this will save us a lot of time.

Install Rancher:v2.4.2 in an RKE 3-node cluster, and add couple clusters to the setup, and “break” some of them. Here is the CPU and memory consumption of the rancher workload in the local cluster:

Screen Shot 2020-04-21 at 11 02 23 AM

Make another identical setup using the image Rancher:v2.4-2774-head cf5ab1d :

Screen Shot 2020-04-21 at 11 03 19 AM

here is another identical setup using Rancher:master-2792-head

Screen Shot 2020-04-22 at 11 02 13 AM

Ugh. This issue is so annoying. I can fix this tomorrow.