linkerd2: Linkerd proxy killed with OOM
Bug Report
What is the issue?
The linkerd proxy is getting killed with OOM on regular basis. However, the memory usage before the proxy is killed is stable at ~6MB. We see neither memory creep nor spikes. The problem might be that the proxy is trying to allocate a lot of memory at once or in a very short period of time, which is why the high usage is not reflected in our monitoring system.
How can it be reproduced?
We are unable to reproduce this issue in a controlled manner, but it happens sporadically in all of our environments.
Logs, error output, etc
Here is a log of kubectl describe showing the status of the linkerd-proxy
container:
Container ID: docker://1ad8fc58b8e58274586048c433063bdfbe3f0ee05c59c0aeec9e947c1df7755d
Image: 824725208937.dkr.ecr.eu-central-1.amazonaws.com/linkerd2-proxy:v2.132.0-no-authority
Image ID: docker-pullable://824725208937.dkr.ecr.eu-central-1.amazonaws.com/linkerd2-proxy@sha256:50d010db0648bfcad10c32f859497978ac1bf8e00d843c279e5c18c9b9962c16
Ports: 4143/TCP, 4191/TCP
Host Ports: 0/TCP, 0/TCP
State: Running
Started: Wed, 17 Feb 2021 12:23:06 +0100
Last State: Terminated
Reason: OOMKilled
Message: time="2021-02-17T11:22:47Z" level=info msg="running version edge-21.1.2"
time="2021-02-17T11:22:47Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2021-02-17T11:22:47Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[ 0.001847s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.002386s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.002403s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.002405s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.002407s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.002409s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.stage.serviceaccount.identity.linkerd.cluster.local
[ 0.002412s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.002414s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.002567s] INFO ThreadId(01) outbound: linkerd_app: listen.addr=127.0.0.1:4140 ingress_mode=false
[ 0.002695s] INFO ThreadId(01) inbound: linkerd_app: listen.addr=0.0.0.0:4143
[ 0.018728s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: default.stage.serviceaccount.identity.linkerd.cluster.local
Exit Code: 137
Started: Wed, 17 Feb 2021 12:22:47 +0100
Finished: Wed, 17 Feb 2021 12:22:51 +0100
Ready: True
Restart Count: 2
Limits:
memory: 250Mi
Requests:
cpu: 100m
memory: 250Mi
linkerd check
output
fp@sync > linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on 2021-04-02T16:06:37Z
see https://linkerd.io/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
is running version 2.9.2 but the latest stable version is 2.9.3
see https://linkerd.io/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 2.9.2 but the latest stable version is 2.9.3
see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match
linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
linkerd-prometheus
------------------
√ prometheus add-on service account exists
√ prometheus add-on config map exists
√ prometheus pod is running
linkerd-grafana
---------------
√ grafana add-on service account exists
√ grafana add-on config map exists
√ grafana pod is running
Environment
- Kubernetes Version: v1.18.9
- Cluster Environment: EKS
- Host OS: Amazon Linux 2
- Linkerd version: 2.9.2
- Linkerd proxy version: edge-21.1.2
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 19 (16 by maintainers)
Commits related to this issue
- outbound: Prevent connections on the loopback interface linkerd/linkerd2#5764 reports proxy OOMs, apparently caused by traffic looping from the outbound proxy back to the inbound proxy. While we don'... — committed to linkerd/linkerd2-proxy by olix0r 3 years ago
- outbound: Prevent connections on the loopback interface (#924) linkerd/linkerd2#5764 reports proxy OOMs, apparently caused by traffic looping from the outbound proxy back to the inbound proxy. While... — committed to linkerd/linkerd2-proxy by olix0r 3 years ago
@shahriak I’d recommend using stable 2.10.1 since that is the more recent stable that fixed some issues we found soon after releasing 2.10.
As for this specific issue, we had a few ideas on what it was but I don’t think we ever could say exactly what fixed the issue. That being said, we haven’t seen many other OOM issues recently so I think you should be okay upgrading. If you do run into an issue like this though, you should definitely open a new issue with a fresh description and we’ll look into it ASAP.
You were right, we have a twistlock pod (a container security software) running with that IP address and it was probably trying to connect to 4143.
Regarding the OOMs, we can try out the new edge proxy version in the next week or so and I can report back if we continue seeing the issues. Thanks for the fast turnaround on this one.
We have a new finding from today’s restarts. We consistently observe a spike in open file descriptors before the OOM event happens. The proxy in a linkerd-sp-validator pod restarted at exactly 07:38:55 which is also when the file descriptor count rapidly increased:
kubectl describe logs