rancher: rancher 2.0.4 server memory leak
Rancher versions: rancher/rancher: v2.0.4
kubernetes versions:
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.1", GitCommit:"d4ab47518836c750f9949b9e0d387f20fb92260b", GitTreeState:"clean", BuildDate:"2018-04-12T14:14:26Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Docker version: (docker version
,docker info
preferred)
Client:
Version: 1.12.6
API version: 1.24
Go version: go1.6.4
Git commit: 78d1802
Built: Tue Jan 10 20:38:45 2017
OS/Arch: linux/amd64
Server:
Version: 1.12.6
API version: 1.24
Go version: go1.6.4
Git commit: 78d1802
Built: Tue Jan 10 20:38:45 2017
OS/Arch: linux/amd64
Operating system and kernel: (cat /etc/os-release
, uname -r
preferred)
NAME="Ubuntu"
VERSION="16.04.1 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.1 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
UBUNTU_CODENAME=xenial
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) OpenStack VM
Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB) We set up a 3-nodes Kubernets cluster manually via RKE in OpenStack backed VM and deployed a single pod deployment of rancher 2.0.4(We are deploying Rancher2.0 since technical review).
Recently this system became a web-console for our production Kubernetes clusters and We discovered that pod of cattle deployment are crashing because of OOM(RSS nearly 7GiB) every several days.
Every time cattle pod crashes, all of our cluster needs to re-establish connections to it which is actually quite annoying
Is there some progress on this issue or can we help out for further investigation?
Results:
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 29 (6 by maintainers)
We’ve addressed 3 or 4 different performance issues around this ticket in v2.0.8. @cloudnautique has verified those through performance testing. As he suggests, I am going to close this issue. New bugs or issues around memory leaks or performance should be opened as new issues.
@cloudnautique of course, i will post tomorrow. In short, the problem described by @BowlingX. I noticed memory increasing to 100% on the catalog apps tab and probably other tabs.
Rancher 2.0.6 with Kubernetes v1.10.5-rancher1-1
Maybe a hint in which direction to look (from the rancher server logs correlated with the cpu/memory spikes):
last message repeats like 5 times a second
Edit: The unmarshal error is already mentioned in #12332. Unmarshal error occurs indeed periodically in the logs, but only sometimes followed with the “backup up reader” error and the oom issue.
Yeah, we can use this for the memory consumption work.
There have been several performance related fixes that have gone in but not tagged specifically to this issue. @cloudnautique can you use this issue to track your performance testing?
An update of this issue.
Even after we relocated our rancher server’ public Internet access point, the cattle pod still crashes due to kernel OOM kill
We are seeing the same issue in production with our cluster nodes. Memory usage creeps up until all available memory is consumed, then the node becomes unresponsive and needs to be replaced. We now run 8GB nodes to alleviate the problem to a degree but as you can see in the screenshot (showing a range since Monday), there definitely is a leak somewhere.
We currently run 3 EC2 m5.large worker nodes with RancherOS 1.4.0 using the EU-Frankfurt AMI provided by RancherOS and a separate etcd/control t2.medium node with RancherOS 1.4.0. Each worker node has 15-20 pods. Traffic gets to pods via EC2 load balancers managed by the K8 AWS plugin provided by Rancher. We are still running Rancher 2.0.0 but I would imagine that has little impact on this issue. Edit: Kubernetes is v1.10.1
So far the etcd/control node has not crashed but the worker nodes regularly did crash.
Edit 2: Our Rancher server is also very far away from the actual cluster.