traefik: High CPU usage

Do you want to request a feature or report a bug?

bug

What did you do?

When using traefik in front of dynamic web applications (e.g. nextcloud, speedtest) i see very high cpu usage up to 100%, when data is transferred (e.g. large downloads or speedtests).

Note: traefik and web apps run both in docker swarm

Update 1: disable compression on entrypoint. cpu usage still at 50-60% - is this expected?

What did you expect to see?

no cpu spikes (e.g. like general load balancers / reverse proxies)

What did you see instead?

cpu rises up to 100% cpu load

Output of traefik version: (What version of Traefik are you using?)

Version:      v1.4.5
Codename:     roquefort
Go version:   go1.9.2
Built:        2017-12-06_10:16:48AM
OS/Arch:      linux/amd64

What is your environment & configuration (arguments, toml, provider, platform, …)?

services:
  traefik:
    image: ${IMAGE}:${RELEASE}
    environment:
      ACME_EMAIL: ${ACME_EMAIL}
      ACME_DEFAULT_HOST: ${ACME_DEFAULT_HOST}
    command:
      - --configfile=/run/secrets/traefik_admin
      - --loglevel=WARN
      - --checknewversion=false
      - --insecureskipverify=true
      - --defaultentrypoints=http,https
      - "--entrypoints=Name:http Address::80 Compress:true Redirect.EntryPoint:https"
      - "--entryPoints=Name:https Address::443 TLS Compress:true"
      - --acme
      - --acme.email=${ACME_EMAIL}
      - --acme.domains=${ACME_DEFAULT_HOST}
      - --acme.onhostrule
      - --acme.entrypoint=https
      - --acme.storage=/data/acme.json
      - --docker
      - --docker.watch
      - --docker.exposedbydefault=false
      - --docker.swarmmode
      # - --debug
    networks:
     - traefik
    ports:
      - target: 80
        published: 80
        protocol: tcp
        mode: host
      - target: 443
        published: 443
        protocol: tcp
        mode: host
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - traefik:/data
    healthcheck:
      test: ["CMD-SHELL", "traefik healthcheck --web"]
      interval: 30s
      timeout: 5s
      retries: 2
      start_period: 30s
    deploy:
      restart_policy:
        condition: any
        delay: 2s
      update_config:
        monitor: 120s
        failure_action: continue
      labels:
        - "traefik.enable=true"
        - "traefik.docker.network=org_traefik"
        - "traefik.port=8080"
        - "traefik.protocol=http"
        - "traefik.frontend.rule=Host:${ACME_DEFAULT_HOST};PathPrefixStrip:/traefik"
    secrets:
      - traefik_admin

If applicable, please paste the log output in debug mode (--debug switch)

(paste your output here)

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 9
  • Comments: 37 (16 by maintainers)

Most upvoted comments

Things have gotten worse in 1.6, I wish you would give more love towards performance soon, because it is not that far away till your space is overtaken by Envoy or some Nginx cloud-native advancements.

Ok I see, so right now we increased the limit to 100Mi, but we will have to monitor if this is enough for our application. I am happy though that at least now we have a logical explanation for this weird behavior we’ve been seeing.

Looking at @GaruGaru’s profile, it seems like most CPU cycles (43.25% sample time) are spent inside Go’s math package for the purpose of TLS communication/handshaking:

(pprof) top
Showing nodes accounting for 16.46s, 72.86% of 22.59s total
Dropped 576 nodes (cum <= 0.11s)
Showing top 10 nodes out of 153
      flat  flat%   sum%        cum   cum%
     9.77s 43.25% 43.25%      9.77s 43.25%  math/big.addMulVVW
     1.94s  8.59% 51.84%     12.43s 55.02%  math/big.nat.montgomery
     1.67s  7.39% 59.23%      1.74s  7.70%  syscall.Syscall
     1.03s  4.56% 63.79%      1.03s  4.56%  runtime.memmove
     0.58s  2.57% 66.36%      0.58s  2.57%  math/big.mulAddVWW
     0.41s  1.81% 68.17%      1.90s  8.41%  math/big.nat.divLarge
     0.34s  1.51% 69.68%      0.34s  1.51%  runtime.usleep
     0.27s  1.20% 70.87%      0.27s  1.20%  crypto/sha256.block
     0.26s  1.15% 72.02%      0.26s  1.15%  math/big.subVV
     0.19s  0.84% 72.86%      0.19s  0.84%  math/big.shlVU

As a graph:

garugaru-profile

It’s hard for me to tell if those 44% CPU consumption that @GaruGaru reported are reasonable or not. (By the way, is that 44% per core or per entire machine?) It also depends on the request rate that those 50 concurrent users produced.

Is there a chance for you to disable TLS to see if it makes a difference?

You’re right, high cpu usage due to slow handshakes with default RSA4096 cert in traefik, check golang issue https://github.com/golang/go/issues/20058 . I change keytype to ECDSA “–acme.keytype=EC256” and cpu usage back to normal.

This memory limit in the helm chart is a pretty painful default – your team is not the first one that has been hurt by it. I tried to lobby a change in the chart values a while ago, but no luck to this date.

Now, given that the official helm chart is maintained by traefik team directly, feel free to propose a change in the defaults if you agree with me that this is reasonable. IMO the compound time spent on upgrading from chart v1 to v2 is far less than the effort needed to tackle this IO nightmare people randomly enter. It took me a while to figure out why traefik started killing my CPU, I felt desperate during the investigation 😞

If someone posts some instructions I’d be happy to do some profiling.

Hi @tboerger , the Traefik 2.0 GA version don’t have this behaviour as far as we know (reference: #5294 was closed during the RC phase).

As this issue is old and hijacked by a lot of different users, we are closing it for triage cleanup.

Please feel free to open new issues with reproduction case if you are still having this behavior with latest Traefik v2.0. Thanks!

The issue is not specific to compression, it’s about resource usage in general. When you have compression, https, high load or any other resource-consuming factor, traefik goes over the default 20MiB ram limit and starts endlessly writing and reading from swap.

IMO the problem is that chart defaults are not picked correctly and this causes CPU madness in pretty standard conditions. I was in favor of changing the defaults and thus upgrading the chart to v2, but this proposal has not been accepted.

The result is that new traefik chart users fall into the same trap as me a year ago.

@Falx Do you have a memory limit set in your pod? The issue could be related to this. I once noticed high CPU usage after certain requests and it simply turned out that there was a lot of pod’s swap IO due to memory limits set my default in the helm chart.

problem: #1908 (comment)

solution: #1908 (comment)

Thank you very much for this information. We’ve updated our memoryLimit in the helm chart to 100Mi and will see in the coming days!

We have recently upgraded to Traefik version 1.6.4 and the CPU usage increased dramatically. screen shot 2018-06-24 at 21 53 04

Similar here. Traefik consumes more CPU than the backend PHP app and almost the same as MySQL.

I have observed this behavior since I started using Traefik at 1.3.x, it seems to be even a bit worse at 1.5.4

All static files are hosted by a CDN and the site is not using a full page cache.

When I attach strace, most of the output is:

futex() = 1
futex() = -1 EAGAIN (Resource temporarily unavailable)
read() = -1 EAGAIN (Resource temporarily unavailable

Total times:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.23    1.343642         116     11631       830 futex
  1.58    0.021588           4      5459           epoll_wait
  0.06    0.000875           1      1704        38 write
  0.05    0.000672           0      1632           pselect6
  0.05    0.000662           0      3201      2328 read
  0.02    0.000306           3        95           getrandom
  0.00    0.000050           0       111           close
  0.00    0.000000           0        12           sched_yield
  0.00    0.000000           0         1           socket
  0.00    0.000000           0         1         1 connect
  0.00    0.000000           0        44           getsockname
  0.00    0.000000           0        45           setsockopt
  0.00    0.000000           0         1         1 restart_syscall
  0.00    0.000000           0       156           epoll_ctl
  0.00    0.000000           0        87        43 accept4
------ ----------- ----------- --------- --------- ----------------
100.00    1.367795                 24180      3241 total

Done the profiling on a test machine (cloud scaleway VC1S instance 2 core Intel Atom C2750 @ 2.40GHz) with traefik 1.5.3 running on docker (using siege with 50 concurrent users), cpu usage 44% on node with traefik container, only 5% on the node serving the actual content.

profile.zip

My docker-compose

 traefik:
    image: traefik:1.5.3-alpine
    command:
        - "--logLevel=DEBUG"
        - "--api"
        - "--docker"
        - "--docker.swarmmode" 
        - "--docker.watch"
        - "--web"
        - "--web.metrics.statsd.address=<STATSD_HOST>"
        - "--entrypoints=Name:http Address::80 Redirect.EntryPoint:https"
        - "--entrypoints=Name:https Address::443 TLS"
        - "--defaultentrypoints=http,https"
        - "--acme"
        - "--acme.storage=acme.json"
        - "--acme.entryPoint=https"
        - "--acme.httpChallenge.entryPoint=http"
        - "--acme.OnHostRule=true" 
        - "--acme.email=<MAIL>"
        - "--docker"
        - "--docker.swarmmode"
        - "--docker.domain=<DOMAIN>"
        - "--docker.watch"
    networks:
      - proxy
    ports:
        - 80:80
        - 443:443
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

I retested with 1.5.0 and got similar cpu metrics.