linkerd2: Linkerd proxy killed with OOM

Bug Report

What is the issue?

The linkerd proxy is getting killed with OOM on regular basis. However, the memory usage before the proxy is killed is stable at ~6MB. We see neither memory creep nor spikes. The problem might be that the proxy is trying to allocate a lot of memory at once or in a very short period of time, which is why the high usage is not reflected in our monitoring system.

How can it be reproduced?

We are unable to reproduce this issue in a controlled manner, but it happens sporadically in all of our environments.

Logs, error output, etc

Here is a log of kubectl describe showing the status of the linkerd-proxy container:

    Container ID:  docker://1ad8fc58b8e58274586048c433063bdfbe3f0ee05c59c0aeec9e947c1df7755d
    Image:         824725208937.dkr.ecr.eu-central-1.amazonaws.com/linkerd2-proxy:v2.132.0-no-authority
    Image ID:      docker-pullable://824725208937.dkr.ecr.eu-central-1.amazonaws.com/linkerd2-proxy@sha256:50d010db0648bfcad10c32f859497978ac1bf8e00d843c279e5c18c9b9962c16
    Ports:         4143/TCP, 4191/TCP
    Host Ports:    0/TCP, 0/TCP
    State:         Running
      Started:     Wed, 17 Feb 2021 12:23:06 +0100
    Last State:    Terminated
      Reason:      OOMKilled
      Message:     time="2021-02-17T11:22:47Z" level=info msg="running version edge-21.1.2"
time="2021-02-17T11:22:47Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2021-02-17T11:22:47Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[     0.001847s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.002386s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.002403s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.002405s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.002407s]  INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[     0.002409s]  INFO ThreadId(01) linkerd2_proxy: Local identity is default.stage.serviceaccount.identity.linkerd.cluster.local
[     0.002412s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.002414s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.002567s]  INFO ThreadId(01) outbound: linkerd_app: listen.addr=127.0.0.1:4140 ingress_mode=false
[     0.002695s]  INFO ThreadId(01) inbound: linkerd_app: listen.addr=0.0.0.0:4143
[     0.018728s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: default.stage.serviceaccount.identity.linkerd.cluster.local
      Exit Code:    137
      Started:      Wed, 17 Feb 2021 12:22:47 +0100
      Finished:     Wed, 17 Feb 2021 12:22:51 +0100
    Ready:          True
    Restart Count:  2
    Limits:
      memory:  250Mi
    Requests:
      cpu:      100m
      memory:   250Mi

linkerd check output

fp@sync > linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2021-04-02T16:06:37Z
    see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 2.9.2 but the latest stable version is 2.9.3
    see https://linkerd.io/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.9.2 but the latest stable version is 2.9.3
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match

linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system

linkerd-prometheus
------------------
√ prometheus add-on service account exists
√ prometheus add-on config map exists
√ prometheus pod is running

linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running

Environment

  • Kubernetes Version: v1.18.9
  • Cluster Environment: EKS
  • Host OS: Amazon Linux 2
  • Linkerd version: 2.9.2
  • Linkerd proxy version: edge-21.1.2

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 19 (16 by maintainers)

Commits related to this issue

Most upvoted comments

@shahriak I’d recommend using stable 2.10.1 since that is the more recent stable that fixed some issues we found soon after releasing 2.10.

As for this specific issue, we had a few ideas on what it was but I don’t think we ever could say exactly what fixed the issue. That being said, we haven’t seen many other OOM issues recently so I think you should be okay upgrading. If you do run into an issue like this though, you should definitely open a new issue with a fresh description and we’ll look into it ASAP.

You were right, we have a twistlock pod (a container security software) running with that IP address and it was probably trying to connect to 4143.

Regarding the OOMs, we can try out the new edge proxy version in the next week or so and I can report back if we continue seeing the issues. Thanks for the fast turnaround on this one.

We have a new finding from today’s restarts. We consistently observe a spike in open file descriptors before the OOM event happens. The proxy in a linkerd-sp-validator pod restarted at exactly 07:38:55 which is also when the file descriptor count rapidly increased:

image

kubectl describe logs

  linkerd-proxy:
    Container ID:  docker://fd87216f2630c7047e9769b1120500b1b200e071998ebe0a45130b0757ca9d33
    Image:         824725208937.dkr.ecr.eu-central-1.amazonaws.com/linkerd2-proxy:v2.132.0-no-authority
    Image ID:      docker-pullable://824725208937.dkr.ecr.eu-central-1.amazonaws.com/linkerd2-proxy@sha256:50d010db0648bfcad10c32f859497978ac1bf8e00d843c279e5c18c9b9962c16
    Ports:         4143/TCP, 4191/TCP
    Host Ports:    0/TCP, 0/TCP
    State:         Running
      Started:     Thu, 18 Feb 2021 07:39:06 +0100
    Last State:    Terminated
      Reason:      OOMKilled
      Message:     n closed error=connection closed before message completed
[    19.886690s]  INFO ThreadId(01) outbound:accept{peer.addr=127.0.0.1:49765 target.addr=127.0.0.1:4140}: linkerd_app_core::serve: Connection closed error=connection closed before message completed
[    19.891020s]  INFO ThreadId(01) outbound:accept{peer.addr=127.0.0.1:49809 target.addr=127.0.0.1:4140}: linkerd_app_core::serve: Connection closed error=connection closed before message completed
[    19.891391s]  INFO ThreadId(01) outbound:accept{peer.addr=127.0.0.1:49851 target.addr=127.0.0.1:4140}: linkerd_app_core::serve: Connection closed error=connection closed before message completed
[    19.895871s]  INFO ThreadId(01) outbound:accept{peer.addr=127.0.0.1:49897 target.addr=127.0.0.1:4140}: linkerd_app_core::serve: Connection closed error=connection closed before message completed
[    19.896320s]  INFO ThreadId(01) outbound:accept{peer.addr=127.0.0.1:49941 target.addr=127.0.0.1:4140}: linkerd_app_core::serve: Connection closed error=connection closed before message completed
[    19.900545s]  INFO ThreadId(01) outbound:accept{peer.addr=127.0.0.1:49987 target.addr=127.0.0.1:4140}: linkerd_app_core::serve: Connection closed error=connection closed before message completed
[    19.900892s]  INFO ThreadId(01) outbound:accept{peer.addr=127.0.0.1:50033 target.addr=127.0.0.1:4140}: linkerd_app_core::serve: Connection closed error=connection closed before message completed
[    19.905390s]  INFO ThreadId(01) outbound:accept{peer.addr=127.0.0.1:50081 target.addr=127.0.0.1:4140}: linkerd_app_core::serve: Connection closed error=connection closed before message completed
[    19.905722s]  INFO ThreadId(01) outbound:accept{peer.addr=127.0.0.1:50131 target.addr=127.0.0.1:4140}: linkerd_app_core::serve: Connection closed error=connection closed before message completed
[    19.910100s]  INFO ThreadId(01) outbound:accept{peer.addr=127.0.0.1:50189 target.addr=127.0.0.1:4140}: linkerd_app_core::serve: Connection closed error=connection closed before message completed

      Exit Code:    137
      Started:      Thu, 18 Feb 2021 07:37:54 +0100
      Finished:     Thu, 18 Feb 2021 07:38:55 +0100
    Ready:          True
    Restart Count:  2
    Limits:
      memory:  250Mi
    Requests:
      cpu:      100m
      memory:   250Mi