rancher: Slow upgrade / any action and high CPU utilization

Rancher Versions: Server: 1.3.4 Cattle: 0.175.10 healthcheck: 0.2.0 ipsec: 0.0.2 network-services: 0.0.8 scheduler: 0.3.0 kubernetes (if applicable):

Docker Version: 1.13.0

OS and where are the hosts located? (cloud, bare metal, etc): ~50 hosts Ubuntu16, Bare Metal ~250 containers in 5 services.

Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB) External DB r3.large, 2 Rancher hosts c4.xlarge (4 cores)

Environment Type: (Cattle/Kubernetes/Swarm/Mesos) Cattle v0.175.10

Steps to Reproduce: Run rolling upgrade/finish/rollback on any service or perform almost ANY action on containers

Results: It takes hours to rotate ~50 containers both Rancher “masters” have CPU Utilization 100% When running basic java prfiler: ./jvmtop.sh 9 --profile -n 50 we got:

 JvmTop 0.8.0 alpha - 06:15:02,  amd64,  4 cpus, Linux 4.4.0-59-, load avg 12.90
 http://code.google.com/p/jvmtop
 Profiling PID 9:         io.cattle.platform.launcher.Main 
  23.02% (     2.13s) org.yaml.snakeyaml.resolver.Resolver.resolve()
  15.21% (     1.63s) org.yaml.snakeyaml.emitter.Emitter.writePlain()
  14.23% (     1.52s) ....yaml.snakeyaml.representer.SafeRepresenter$Represent()
   5.87% (     0.63s) ....yaml.snakeyaml.representer.Representer.representJava()
   5.78% (     0.62s) org.mariadb.jdbc.internal.util.buffer.ReadUtil.readFully()
   4.10% (     0.44s) org.yaml.snakeyaml.emitter.Emitter.writeIndent()
   3.97% (     0.43s) org.yaml.snakeyaml.emitter.Emitter.writeIndicator()
   3.41% (     0.37s) org.yaml.snakeyaml.emitter.Emitter.analyzeScalar()
...

It spends more 70% of CPU time encoding and decoding YAMLs!!!

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 11
Comments: 29 (3 by maintainers)

Most upvoted comments

Currently we have ~100 hosts, in 3 environments. When there is no upgrade or any change running, the CPUs of Rancher are quiet.

In Environment1 we have 50 hosts, 5 services each with 50 containers. When I run upgrade on one service (50 containers) one-by-one it takes 30 minutes. And it utilize 32 CPU cores of Rancher almost to 100%.

If Rancher is on weaker instances then upgrade stuck trying to start one container again and again. The messege Need to restart service reconcile appear on service and Error Timeout getting IP address on container. And it stays in this state until restart of metadata.network-services on host where container gets Timeout getting IP address.

When there is some disaster, no disaster, just some stumble it might kill metadata on almost all hosts then Rancher is useless, unable to do ANYTHING!!! Even with 32 CPU cores. And I am not talking about slowness. On Rancher 1.1 same upgrade took 4-5 minutes on 8 CPU cores!

I can understand need of metadata service to know about changes in environment. But the bad thing is that encoding and decoding of YAMLs it cannot take so much CPU and time!!!

tholcman on Feb 15, 2017

@underyx The purpose of try is really for us to experience these pains first hand. But don’t worry I already am aware of these issues and most will be address in 1.5 with more improvements coming in 1.6. 1.5 will be released in a week.

My current use case I’m testing is 1 environment with 50 hosts and 10000 services of scale 1.

ibuildthecloud on Feb 23, 2017

@underyx done. Looks like some improvements could happen here.

aemneina on Feb 23, 2017