rancher: Rancher 2 Master gets OOM killed

**What kind of request is this (question/bug/enhancement/feature request): Bug

**Steps to reproduce (least amount of steps as possible): The rancher master does take about 1.75 GB/s RAM per hour and get killed by the kernel OOM. Only the master is affected, other rancher pods running have a normal memory usage. We already upgraded to 16 GB RAM, but the issue still persists.

We guess it could have been related to cluster-monitoring, as we find data in memory dumps done with GDB. Is the management layer caching cluster-monitoring data?

/proc/PID/smaps:
+c084000000-c08c000000 rw-p 00000000 00:00 0 
+Size:             131072 kB
+Rss:              131072 kB
+Pss:              131072 kB
+Shared_Clean:          0 kB
+Shared_Dirty:          0 kB
+Private_Clean:         0 kB
+Private_Dirty:    131072 kB
+Referenced:       131072 kB
+Anonymous:        131072 kB
+AnonHugePages:         0 kB
+Shared_Hugetlb:        0 kB
+Private_Hugetlb:       0 kB
+Swap:                  0 kB
+SwapPss:               0 kB
+KernelPageSize:        4 kB
+MMUPageSize:           4 kB
+Locked:                0 kB
+VmFlags: rd wr mr mw me ac sd 

{"metadata":{"name":"cluster-monitoring.v341","namespace":"cattle-prometheus","selfLink":"/api/v1/namespaces/cattle-prometheus/configmaps/cluster-monitoring.v341","uid":"de760c13-7bb7-11e9-b87e-005056832c45","resourceVersion":"884046","creationTimestamp":"2019-05-21T11:02:05Z","labels":{"MODIFIED_AT":"1558437104","NAME":"cluster-monitoring","OWNER":"TILLER","STATUS":"SUPERSEDED","VERSION":"340"}},"data":{"release":"...

Are there any suggestions, how to find the sub component in Rancher allocating the memory?

Environment information

Rancher version: v2.2.4
Installation option (single install/HA): HA

Cluster information

Cluster type: Imported rke cluster
Machine specifications (CPU/memory): 6 vCPUs, 16 GB RAM
Docker version:

# docker version
Client:
 Version:           18.09.7
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        2d0083d
 Built:             Thu Jun 27 17:56:17 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.7
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       2d0083d
  Built:            Thu Jun 27 17:23:02 2019
  OS/Arch:          linux/amd64
  Experimental:     false

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 15 (5 by maintainers)

Most upvoted comments

I’m hitting this issue when we use v2.3.2 and I investigated root cause deeply and after I identified root cause, I checked v2.4.2 code whether problem still exists or not. At that time I found https://github.com/rancher/norman/commit/6269ccdbeace958aa76ec92f7d4c42c442459bb1 this patch, and this patch solved what we encountered completely, but I found another reason of goroutine leak in latest (v2.4.2) rancher .

Let me explain the goroutine leak case for each version(v2.3.2, v2.4.2) I faced/investigated

v2.3.2

This was the scenario to cause goroutine leak in our case.

Run multiple rancher-servers
Some of the Cluster are completely broken ( failed to access from rancher-server)
When user-controllers-controller handler(https://github.com/rancher/rancher/blob/v2.3.2/pkg/api/controllers/usercontrollers/usercontroller.go#L121) tried to start UserController for broken Cluster via ClusterManager.Start, it will be failed
ClusterManager.Start is actually failed as followings
- Create Record (which include UserContext)
  - https://github.com/rancher/rancher/blob/v2.3.2/pkg/clustermanager/manager.go#L337
- right after Create new UserContext, create GenericController for RBAC
  - https://github.com/rancher/rancher/blob/v2.3.2/pkg/clustermanager/manager.go#L351
- GenericController create workqueue object, and workqueue object spawn goroutine immediately when workqueue object got created
  - https://github.com/rancher/rancher/blob/v2.3.2/vendor/github.com/rancher/norman/controller/generic_controller.go#L102
    - https://github.com/rancher/rancher/blob/v2.3.2/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go#L41-L58
- After that, we will try to run UserContext.Start() but before that, we will check API availability
  - https://github.com/rancher/rancher/blob/v2.3.2/pkg/clustermanager/manager.go#L140-L142
- If API is not available, run cancelFunc and throw away reference of UserContext without calling UserContext.Start()
- workqueue.ShutDown() is not called at all if we didn’t run UserContext.Start()
  - https://github.com/rancher/rancher/blob/v2.3.2/vendor/github.com/rancher/norman/controller/generic_controller.go#L235
- That’s why goroutine spawned by workqeueu when workqueue object is created will be leaked.

For this case, thanks to https://github.com/rancher/norman/commit/6269ccdbeace958aa76ec92f7d4c42c442459bb1 patch, we could solve since workqueue object will be created when Sync() is evaluated which is part of Start() instead of when GenericController object is created .

v2.4.2

I thought above issue is completely gone, but I hit memory leak again with same condition in rancher/rancher:v2.4.2, so I investigate again, and I found that currently ClusterManager use https://github.com/rancher/wrangler for RBAC cache, and this cause memory leak as followings

Condition
- Run multiple rancher-server
- Create cluster as many as possible (the number of cluster affect the speed of leak)
- Make several cluster as an unavailable
  - simply you can stop kube-apiserver for emulation
- Wait for 1 day or 2 day
How leak is happened (this is very similar to what I explained in above)
1. When CluserManager.start try to start UserControllers against broken Cluster, We generated New UserContext first
  - https://github.com/rancher/rancher/blob/master/pkg/clustermanager/manager.go#L123
2. After UserContext is created, Generate 2 wrangler.Controller for that UserContext
  - https://github.com/rancher/rancher/blob/master/pkg/clustermanager/manager.go#L393
  - -> https://github.com/rancher/rancher/blob/master/pkg/rbac/access_control.go#L12
  - ->-> https://github.com/rancher/rancher/blob/master/vendor/github.com/rancher/steve/pkg/accesscontrol/access_store.go#L30
  - ->->-> here we generate Controller for Role and ClusterRole (workqueue also got created) https://github.com/rancher/rancher/blob/master/vendor/github.com/rancher/steve/pkg/accesscontrol/role_revision_index.go#L18-L19
3. Try to start UserContext
  - https://github.com/rancher/rancher/blob/master/pkg/clustermanager/manager.go#L133
4. Inside doStart (startController -> doStart), we check API availability and only when API is available, we call UserContext.Start()
  - https://github.com/rancher/rancher/blob/master/pkg/clustermanager/manager.go#L187
5. when we failed to access cluster for twice, mark cluster as a Unavailable
  - https://github.com/rancher/rancher/blob/master/pkg/clustermanager/manager.go#L189
6. after mark cluster unavailale, run ClusterManager.Stop(cluster)
  - https://github.com/rancher/rancher/blob/master/pkg/clustermanager/manager.go#L110
7. Inside Stop(), we cancel and delete UserContext from m.controllers
  - https://github.com/rancher/rancher/blob/master/pkg/clustermanager/manager.go#L77-L78
8. Since we run cancel, API availability check loop will be finished and break infinite loop for API check
  - https://github.com/rancher/rancher/blob/master/pkg/clustermanager/manager.go#L192
9. back to caller of ClusterManager.Start with error
10. after a while, user-controllers-controller handler will try to run ClusterManager.Start again and back to procedure 1 .
Point of above procedure
- we didn’t call UserContext.Start
- but we created workqueue(https://github.com/rancher/wrangler/blob/master/pkg/generic/controllerfactory.go#L140) which internally spawn goroutine
- wrangler’s controller call workqueue.ShutDown as a cancel functionality in run()
  - https://github.com/rancher/wrangler/blob/master/pkg/generic/controller.go#L97
- That’s why if we throw away reference of UserContext which have workqueue object without running Start(), workqueue’s goroutine will not be stopped since workqueue.ShutDown will not be called at all
- everytime we failed to start UserController due to API availablity, we leak 2 goroutine ClusterRole wrangler Controller, and Role wrangler Controller.

+14

ukinau on Apr 11, 2020

Thanks for the awesome investigation and write up @ukinau! We’re actively investigating this and this will save us a lot of time.

cjellick on Apr 13, 2020

Install Rancher:v2.4.2 in an RKE 3-node cluster, and add couple clusters to the setup, and “break” some of them. Here is the CPU and memory consumption of the rancher workload in the local cluster:

Make another identical setup using the image Rancher:v2.4-2774-head cf5ab1d :

here is another identical setup using Rancher:master-2792-head

jiaqiluo on Apr 22, 2020

Ugh. This issue is so annoying. I can fix this tomorrow.

ibuildthecloud on Apr 15, 2020