rancher: high CPU usage across all hosts during create/upgrade due to metadata service + DNS

Rancher Versions: Server: 1.5.5 healthcheck: 0.2.3 ipsec: 0.0.7 network-services: 0.9.1 (metadata) and 0.6.6 (network manager) scheduler: 0.7.5 kubernetes (if applicable): nopes

Docker Version: 17.03.0-ce

OS and where are the hosts located? (cloud, bare metal, etc): Ubuntu 16.04.2 LTS (4.4.0) and a few Ubuntu 14.04.5 LTS (3.13.0) running on AWS

Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB) HA Rancher with RDS

Environment Type: (Cattle/Kubernetes/Swarm/Mesos) Cattle

Steps to Reproduce:

  1. Start a cluster with a good number of machines to make sure your metadata service has plenty of data. (the environment where I noted this today was running 48 hosts)
  2. Launch a bunch of containers (maybe some Telegraf to shove CPU and network data into an InfluxDB so you can see stuff in Grafana later)
  3. Launch some more containers… maybe upgrade some containers…

Results:

High CPU usage is observed on ALL hosts in the cluster. Here is one host during an upgrade:

screen shot 2017-04-13 at 3 29 12 pm screen shot 2017-04-13 at 3 29 26 pm

DNS service sends a lot of data during this time as well:

screen shot 2017-04-13 at 3 30 03 pm

Digging into logs for the metadata service I see a ton of quick reloading of answers

time="2017-04-13T22:00:12Z" level=info msg="Download and reload in: 152.962397ms"
time="2017-04-13T22:00:12Z" level=info msg="Update requested for version: 672275"
time="2017-04-13T22:00:12Z" level=info msg="Downloaded in 17.306846ms"
time="2017-04-13T22:00:12Z" level=info msg="Generating and reloading answers"
time="2017-04-13T22:00:12Z" level=info msg="Update requested for version: 672274"
time="2017-04-13T22:00:12Z" level=info msg="Generating answers"
time="2017-04-13T22:00:12Z" level=info msg="Generated and reloaded answers"
time="2017-04-13T22:00:12Z" level=info msg="Applied http://ha.rancher.mux/v1/configcontent/metadata-answers?client=v2&requestedVersion=672274?version=672276-14a7928f14789d3a00ab33efdfd9c22c"
time="2017-04-13T22:00:12Z" level=info msg="Download and reload in: 312.971699ms"
time="2017-04-13T22:00:12Z" level=info msg="Downloaded in 10.124486ms"
time="2017-04-13T22:00:12Z" level=info msg="Generating and reloading answers"
time="2017-04-13T22:00:12Z" level=info msg="Update requested for version: 672278"
time="2017-04-13T22:00:12Z" level=info msg="Generating answers"
time="2017-04-13T22:00:13Z" level=info msg="Generated and reloaded answers"
time="2017-04-13T22:00:13Z" level=info msg="Applied http://ha.rancher.mux/v1/configcontent/metadata-answers?client=v2&requestedVersion=672274?version=672275-14a7928f14789d3a00ab33efdfd9c22c"
time="2017-04-13T22:00:13Z" level=info msg="Download and reload in: 330.665421ms"
time="2017-04-13T22:00:13Z" level=info msg="Update requested for version: 672278"
time="2017-04-13T22:00:13Z" level=info msg="Downloaded in 42.239164ms"
time="2017-04-13T22:00:13Z" level=info msg="Generating and reloading answers"
time="2017-04-13T22:00:13Z" level=info msg="Update requested for version: 672277"
time="2017-04-13T22:00:13Z" level=info msg="Generating answers"
time="2017-04-13T22:00:13Z" level=info msg="Generated and reloaded answers"
time="2017-04-13T22:00:13Z" level=info msg="Applied http://ha.rancher.mux/v1/configcontent/metadata-answers?client=v2&requestedVersion=672278?version=672278-14a7928f14789d3a00ab33efdfd9c22c"
time="2017-04-13T22:00:13Z" level=info msg="Download and reload in: 298.367767ms"
time="2017-04-13T22:00:13Z" level=info msg="Downloaded in 19.420332ms"
time="2017-04-13T22:00:13Z" level=info msg="Generating and reloading answers"
time="2017-04-13T22:00:13Z" level=info msg="Generating answers"
time="2017-04-13T22:00:13Z" level=info msg="Generated and reloaded answers"
time="2017-04-13T22:00:13Z" level=info msg="Applied http://ha.rancher.mux/v1/configcontent/metadata-answers?client=v2&requestedVersion=672277?version=672278-14a7928f14789d3a00ab33efdfd9c22c"
time="2017-04-13T22:00:13Z" level=info msg="Download and reload in: 327.065607ms"
time="2017-04-13T22:00:14Z" level=info msg="Update requested for version: 672279"
time="2017-04-13T22:00:14Z" level=info msg="Downloaded in 50.613182ms"
time="2017-04-13T22:00:14Z" level=info msg="Generating and reloading answers"
time="2017-04-13T22:00:14Z" level=info msg="Generating answers"
time="2017-04-13T22:00:14Z" level=info msg="Generated and reloaded answers"

at the same time the DNS service is reloading aggressively as well

time="2017-04-13T22:00:12Z" level=info msg="Reloading answers"
time="2017-04-13T22:00:12Z" level=info msg="Reloaded answers"
time="2017-04-13T22:00:12Z" level=info msg="Reloading answers"
time="2017-04-13T22:00:12Z" level=info msg="Reloaded answers"
time="2017-04-13T22:00:12Z" level=info msg="Reloading answers"
time="2017-04-13T22:00:12Z" level=info msg="Reloaded answers"
time="2017-04-13T22:00:13Z" level=info msg="Reloading answers"
time="2017-04-13T22:00:13Z" level=info msg="Reloaded answers"
time="2017-04-13T22:00:13Z" level=info msg="Reloading answers"
time="2017-04-13T22:00:13Z" level=info msg="Reloaded answers"
time="2017-04-13T22:00:14Z" level=info msg="Reloading answers"
time="2017-04-13T22:00:14Z" level=info msg="Reloaded answers"
time="2017-04-13T22:00:14Z" level=info msg="Reloading answers"
time="2017-04-13T22:00:14Z" level=info msg="Reloaded answers"
time="2017-04-13T22:00:15Z" level=info msg="Reloading answers"
time="2017-04-13T22:00:15Z" level=info msg="Reloaded answers"
time="2017-04-13T22:00:15Z" level=info msg="Reloading answers"
time="2017-04-13T22:00:15Z" level=info msg="Reloaded answers"

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 7
  • Comments: 18

Most upvoted comments

I’ve got the same issue. CPU usage is really high for this little microservice.

Hello I am experiencing the same after upgrading to 1.6.12.

Also seeing the same thing with rancher 1.6.14 and docker 17.12.

@aemneina I think that this issue ought to be re-opened or tracked in a new issue

We are having similar issues on Rancher 1.6.2 (3k containers, 9 hosts)