traefik: Traefik killed by OOM killer in Kubernetes

Do you want to request a feature or report a bug?

Bug

What did you do?

Running Traefik as Deployment (ingress controller) in K8s
Having bunch of Services targeting no Pods (thus having no related Endpoints object)
Our setup is specific in way that we downscale tens of deployments to 0 on some schedule to not waste resources of K8s cluster.

What did you expect to see?

Traefik to work and reply with error when tried to access such a service

What did you see instead?

Traefik leaking memory and eventually being killed by OOM killer

Output of `traefik version`: (What version of Traefik are you using?)

1.7.4

What is your environment & configuration (arguments, toml, provider, platform, …)?

    defaultEntryPoints = ["http","https"]
    [entryPoints]
      [entryPoints.http]
      address = ":80"
      compress = true
        [entryPoints.http.proxyProtocol]
        trustedIPs = ["xxxx"]
      [entryPoints.https]
      address = ":443"
      compress = true
        [entryPoints.https.proxyProtocol]
        trustedIPs = ["xxxx"]
        [entryPoints.https.tls]
        minVersion = "VersionTLS11"
        cipherSuites = [
          "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
          "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305",
          "TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA",
          "TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA",
          "TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA",
          "TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA"
        ]
    [respondingTimeouts]
      idleTimeout = "900s"
    [lifeCycle]
      requestAcceptGraceTimeout = "10s"
      graceTimeOut = "10s"
    [api]
    [rest]
    [kubernetes]
    [kubernetes.ingressEndpoint]
      hostname = "xxxx.xxxx"
    [metrics]
      [metrics.prometheus]
    [accessLog]

Logs

time="2018-11-28T11:44:33Z" level=warning msg="Endpoints not available for xxx"
time="2018-11-28T11:44:33Z" level=warning msg="Endpoints not available for xxx"
time="2018-11-28T11:44:33Z" level=warning msg="Endpoints not available for xxx"

Tons of those ^

I suspect it being a memory leak in watcher watching the Services/Endpoints as this memory growth is specific just in this case, on our other clusters where we don’t downscale Deployments like this Traefik works just fine.

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 16 (13 by maintainers)

Most upvoted comments

Thanks, sure will try to.

I think it is quite apparent from our Grafana boards. Bottom graph is go_memstats_* and top one is process_*_memory_bytes Cluster with the leakage: screenshot 2018-11-28 at 16 56 25 Normal operation: I think it is clear it consumes way less memory, each peak on first screenshot means the Pod was killed by OOM as it hit the cgroups limit.

Deployment:

Name:                   traefik-ingress-controller
Namespace:              traefik
CreationTimestamp:      Mon, 23 Jul 2018 16:31:25 +0200
Labels:                 app=traefik-ingress
                        chart=traefik-ingress-0.1.0
                        heritage=Tiller
                        release=wandera
Annotations:            deployment.kubernetes.io/revision: 11
                        kubectl.kubernetes.io/last-applied-configuration:
                          {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"app":"traefik-ingress","chart":"traefik-ingress-0.1.0"...
Selector:               app=traefik-ingress,release=wandera
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=traefik-ingress
                    release=wandera
  Annotations:      checksum/config: 12c4dab3982cdb720b274f2605aea93ee5e5fbd1fb4ce38f8c03d43dd371e661
  Service Account:  traefik-ingress-controller
  Containers:
   traefik-ingress-lb:
    Image:       traefik:v1.7.4-alpine
    Ports:       80/TCP, 443/TCP, 8080/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Args:
      --configfile=/config/traefik.toml
      --logLevel=info
    Limits:
      cpu:     500m
      memory:  312Mi
    Requests:
      cpu:        200m
      memory:     256Mi
    Liveness:     tcp-socket :80 delay=10s timeout=2s period=10s #success=1 #failure=3
    Readiness:    tcp-socket :80 delay=10s timeout=2s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /config from config-volume (rw)
  Volumes:
   config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      traefik-config
    Optional:  false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   traefik-ingress-controller-5bbfbbdd6b (2/2 replicas created)
Events:          <none>

Pod:

Name:           traefik-ingress-controller-5bbfbbdd6b-gvtdl
Namespace:      traefik
Node:           xxx
Start Time:     Tue, 13 Nov 2018 14:40:39 +0100
Labels:         app=traefik-ingress
                pod-template-hash=1669668826
                release=wandera
Annotations:    checksum/config: 12c4dab3982cdb720b274f2605aea93ee5e5fbd1fb4ce38f8c03d43dd371e661
Status:         Running
IP:             xxx
Controlled By:  ReplicaSet/traefik-ingress-controller-5bbfbbdd6b
Containers:
  traefik-ingress-lb:
    Container ID:  xxx
    Image:         traefik:v1.7.4-alpine
    Image ID:      xxx
    Ports:         80/TCP, 443/TCP, 8080/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      --configfile=/config/traefik.toml
      --logLevel=info
    State:          Running
      Started:      Wed, 28 Nov 2018 17:04:05 +0100
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 28 Nov 2018 16:57:59 +0100
      Finished:     Wed, 28 Nov 2018 17:03:45 +0100
    Ready:          True
    Restart Count:  2375
    Limits:
      cpu:     500m
      memory:  312Mi
    Requests:
      cpu:        200m
      memory:     256Mi
    Liveness:     tcp-socket :80 delay=10s timeout=2s period=10s #success=1 #failure=3
    Readiness:    tcp-socket :80 delay=10s timeout=2s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /config from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from traefik-ingress-controller-token-k7njj (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          True
  PodScheduled   True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      traefik-config
    Optional:  false
  traefik-ingress-controller-token-k7njj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  traefik-ingress-controller-token-k7njj
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  role=shared
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 role=shared:NoSchedule
Events:
  Type     Reason   Age                     From                                                 Message
  ----     ------   ----                    ----                                                 -------
  Warning  BackOff  2m31s (x8619 over 15d)  kubelet, xxx  Back-off restarting failed container

The scheduler logs are not of much help as the cluster is quite busy, and the pods are not rescheduled (just restarted) as it is apparent from Warning BackOff 2m31s (x8619 over 15d) kubelet, xxx Back-off restarting failed container. What metrics do you want me to provide? Dump of prometheus endpoint?

coufalja on Nov 28, 2018

OFC

Ingresses: 340 Services: 451 (All ClusterIPs) Endpoints: 451

What might be of interest is the number of services without the Endpoints (it changes over time with more over out-of-business-hours).

MissingEndpoints: 200 on average up to 451 (Num of Services actually)

coufalja on Dec 3, 2018