istio: Memory leak in the Istio sidecar proxy
Is this the right place to submit this?
- This is not a security vulnerability or a crashing bug
- This is not a question about how to use Istio
Bug Description
We have been running Istio sidecar proxies that are consistently leaking memory over the span of a few days (as shown in the chart that uses container_memory_working_set_bytes).
We are seeing a slow but constant increase in memory on high traffic pods. Sidecar memory goes from 100-200mb to 1-2GB over the course of a few days and eventually results in the pod getting OOM killed.
Below a chart showing the memory consumption for one of the affected pods. Pod started on 25th Oct with 14% (of 1Gb) memory consumption and today sits at ~60%. In a few more days, it’ll result in the pod getting killed once the mem crosses the configured limit of 1GB. We have other cases where this increases much faster. It looks like it the rate of increase is proportional to the traffic.
Version
❯ istioctl version
client version: 1.18.2
control plane version: 1.18.5
data plane version: 1.18.2 (11 proxies), 1.18.3 (18 proxies), 1.18.5 (369 proxies)
Additional Information
I reviewed the output of istioctl bug-report
and not sure if I can share it publicly. Happy to email/slack it over to a maintainer.
About this issue
- Original URL
- State: open
- Created 7 months ago
- Comments: 24 (11 by maintainers)
I further analyzed the metric dimensions and although I did not see any obvious ones with potential high cardinality, I tried to remove everything that we did not need in our observability dashboards and things have been pretty good since then. Here are the things I tried.
response_code
,response_flags
). Also addeddisable_host_header_fallback: true
even though if I understand correctly this should not have had any effect on anything as we were already droppedrequest_host
dimensions everywhere. Observed this for about 16 hours and memory was still gradually increasingdestination_port
from all metrics and cleaned up some custom metric configuration [1] where we map/transform dimension values. A lot of these dimensions were not in use anymore so dropped the custom config was not an issue. I’m not sure if we even needed it in the first place. Since this patch, we’ve not seen any increase in memory consumption at all for any service in any environment. I tried to isolate it a couple of times but did not narrow it down to any single config. Note that it takes at least a day or two to notice the leak so trying all combinations is quite time consuming. I’d still like to try and isolate it in the next few weeks as time allows.Above timeline as a chart:
Our telemetry config before and after
Note that even though our telemetry config is almost the same for gateways and sidecars (inbound/outside), we never experienced the issue with the gateways.
Hope this helps others who’ve run into similar issues and more importantly the Istio team to identify the root cause and perhaps patch and/or document it somewhere as a best practice/FAQ.
1: custom metric config example:
Update: I disabled all telemetry and within 10 mins memory consumption dropped from 40 - 80% to 10 - 20% on all pods. It has remained flat and has not increased in the last 12 hours or so which is very promising. Will keep monitoring it for a couple of days and then try to close in on which metrics/dimensions are causing the “leak”.