rancher: rancher 2.0.4 server memory leak

Rancher versions: rancher/rancher: v2.0.4

kubernetes versions:

Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.1", GitCommit:"d4ab47518836c750f9949b9e0d387f20fb92260b", GitTreeState:"clean", BuildDate:"2018-04-12T14:14:26Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Docker version: (docker version,docker info preferred)

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:38:45 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:38:45 2017
 OS/Arch:      linux/amd64

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

NAME="Ubuntu"
VERSION="16.04.1 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.1 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
UBUNTU_CODENAME=xenial

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) OpenStack VM

Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB) We set up a 3-nodes Kubernets cluster manually via RKE in OpenStack backed VM and deployed a single pod deployment of rancher 2.0.4(We are deploying Rancher2.0 since technical review).

Recently this system became a web-console for our production Kubernetes clusters and We discovered that pod of cattle deployment are crashing because of OOM(RSS nearly 7GiB) every several days.

Every time cattle pod crashes, all of our cluster needs to re-establish connections to it which is actually quite annoying

Is there some progress on this issue or can we help out for further investigation?

Results: screen shot 2018-07-03 at 6 19 45 pm

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 29 (6 by maintainers)

Most upvoted comments

We’ve addressed 3 or 4 different performance issues around this ticket in v2.0.8. @cloudnautique has verified those through performance testing. As he suggests, I am going to close this issue. New bugs or issues around memory leaks or performance should be opened as new issues.

@cloudnautique of course, i will post tomorrow. In short, the problem described by @BowlingX. I noticed memory increasing to 100% on the catalog apps tab and probably other tabs.

Rancher 2.0.6 with Kubernetes v1.10.5-rancher1-1

Maybe a hint in which direction to look (from the rancher server logs correlated with the cpu/memory spikes):

E0803 05:51:25.597774       1 streamwatcher.go:109] Unable to decode an event from the watch stream: json: cannot unmarshal string into Go struct field dynamicEvent.Object of type v3.NodeStatus
E0803 05:52:28.401636       1 streamwatcher.go:109] Unable to decode an event from the watch stream: backed up reader

last message repeats like 5 times a second

Edit: The unmarshal error is already mentioned in #12332. Unmarshal error occurs indeed periodically in the logs, but only sometimes followed with the “backup up reader” error and the oom issue.

Yeah, we can use this for the memory consumption work.

There have been several performance related fixes that have gone in but not tagged specifically to this issue. @cloudnautique can you use this issue to track your performance testing?

An update of this issue.

Even after we relocated our rancher server’ public Internet access point, the cattle pod still crashes due to kernel OOM kill screen shot 2018-07-05 at 12 50 53 pm

We are seeing the same issue in production with our cluster nodes. Memory usage creeps up until all available memory is consumed, then the node becomes unresponsive and needs to be replaced. We now run 8GB nodes to alleviate the problem to a degree but as you can see in the screenshot (showing a range since Monday), there definitely is a leak somewhere.

screenshot from 2018-07-04 16-21-34

We currently run 3 EC2 m5.large worker nodes with RancherOS 1.4.0 using the EU-Frankfurt AMI provided by RancherOS and a separate etcd/control t2.medium node with RancherOS 1.4.0. Each worker node has 15-20 pods. Traffic gets to pods via EC2 load balancers managed by the K8 AWS plugin provided by Rancher. We are still running Rancher 2.0.0 but I would imagine that has little impact on this issue. Edit: Kubernetes is v1.10.1

So far the etcd/control node has not crashed but the worker nodes regularly did crash.

Edit 2: Our Rancher server is also very far away from the actual cluster.