istio: Memory leak in the Istio sidecar proxy

Is this the right place to submit this?

  • This is not a security vulnerability or a crashing bug
  • This is not a question about how to use Istio

Bug Description

We have been running Istio sidecar proxies that are consistently leaking memory over the span of a few days (as shown in the chart that uses container_memory_working_set_bytes).

156067608-c1893d08-9f96-463f-9fd6-fcefcbadd910

We are seeing a slow but constant increase in memory on high traffic pods. Sidecar memory goes from 100-200mb to 1-2GB over the course of a few days and eventually results in the pod getting OOM killed.

Below a chart showing the memory consumption for one of the affected pods. Pod started on 25th Oct with 14% (of 1Gb) memory consumption and today sits at ~60%. In a few more days, it’ll result in the pod getting killed once the mem crosses the configured limit of 1GB. We have other cases where this increases much faster. It looks like it the rate of increase is proportional to the traffic.

image

156458291-62605800-415c-4326-899c-fd34aeb1cad1

156458284-f10c486c-8026-46a5-9be8-bbc77d75f97e

Version

❯ istioctl version
client version: 1.18.2
control plane version: 1.18.5
data plane version: 1.18.2 (11 proxies), 1.18.3 (18 proxies), 1.18.5 (369 proxies)

Additional Information

I reviewed the output of istioctl bug-report and not sure if I can share it publicly. Happy to email/slack it over to a maintainer.

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Comments: 24 (11 by maintainers)

Most upvoted comments

I further analyzed the metric dimensions and although I did not see any obvious ones with potential high cardinality, I tried to remove everything that we did not need in our observability dashboards and things have been pretty good since then. Here are the things I tried.

  1. [Nov 23] Drop destination_service and request_host dimensions from sidecars as they had obvious high cardinality issues for some services. Initially it seemed this fixed everything but it was obvious memory was still growing for most services some much slower than before though.
  2. [Nov 29] We had not configured all metrics and were missing some (request_messages, response_bytes, etc). I added them to our custom metric configuration as well and dropped all the dimensions we were dropping for other metrics.
  3. [Dec 1] Dropped some more dimensions that were not exactly high cardinality but had a lot of values (destination_port) or we didn’t need in our dashboards (connection_security_policy). Only dropped this from some metrics. Did not seem to have a considerable impact as expected.
  4. [Dec 3] Identified and dropped more dimensions that we did not strictly need to reduce the number of time series generated for histogram buckets (response_code, response_flags). Also added disable_host_header_fallback: true even though if I understand correctly this should not have had any effect on anything as we were already dropped request_host dimensions everywhere. Observed this for about 16 hours and memory was still gradually increasing
  5. [Dec 4] Dropped destination_port from all metrics and cleaned up some custom metric configuration [1] where we map/transform dimension values. A lot of these dimensions were not in use anymore so dropped the custom config was not an issue. I’m not sure if we even needed it in the first place. Since this patch, we’ve not seen any increase in memory consumption at all for any service in any environment. I tried to isolate it a couple of times but did not narrow it down to any single config. Note that it takes at least a day or two to notice the leak so trying all combinations is quite time consuming. I’d still like to try and isolate it in the next few weeks as time allows.

Above timeline as a chart:

overall

Our telemetry config before and after

Note that even though our telemetry config is almost the same for gateways and sidecars (inbound/outside), we never experienced the issue with the gateways.


Hope this helps others who’ve run into similar issues and more importantly the Istio team to identify the root cause and perhaps patch and/or document it somewhere as a best practice/FAQ.


1: custom metric config example: image

Update: I disabled all telemetry and within 10 mins memory consumption dropped from 40 - 80% to 10 - 20% on all pods. It has remained flat and has not increased in the last 12 hours or so which is very promising. Will keep monitoring it for a couple of days and then try to close in on which metrics/dimensions are causing the “leak”.