istio: Telemetry v2 envoy needs metric expiry

Bug description Mixer in telemetry v1 had the concept of metric expiry, from memory it was 10 mins (any series not seen in 10 minutes were dropped).

That automatically cleaned up things that no longer exit. We have a cluster continually running tests with workloads that reflect those tests.

In v2, these never get cleaned up and envoys keep reference to them until those envoys are restarted:

cc. @douglas-reid @mandarjog

Expected behavior Metrics should expire like they used to do in v1.

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm) 1.4.5

How was Istio installed?

Environment where bug was observed (cloud vendor, OS, etc)

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 20
Comments: 30 (25 by maintainers)

Most upvoted comments

Hum. This would be very useful feature indeed.

What kind of work is needed here? Does this involve some work on Envoy? Or only on the stats filter?

+10

chaudyg on Sep 8, 2021

Any updates? Definitely having the same issue and would be great to have the expiry on v2.

juwatanabe on Jun 27, 2022

Hi @douglas-reid @mandarjog @howardjohn just curious what if this is still a priority since this would be an issue for the use cases where people want to use it on production and over a period of time the prometheus scraping metrics from envoy pods wont scale anymore.

rakesh-garimella on Nov 10, 2020

Here’s Envoy side issue for tracking this; https://github.com/envoyproxy/envoy/issues/14070

mathetake on Sep 28, 2021

What kind of work is needed here? Does this involve some work on Envoy? Or only on the stats filter?

Yes this needs support at Envoy stats system.

bianpengyuan on Sep 8, 2021

Moving to 1.12 due to recent inactivity

ryantking on Jul 26, 2021

I believe this issue is causing our ingress-gateways to be OOMKilled. We are showing a linear increase in both memory usage and transmit bandwidth on our ingress gateways over a roughly 2-month period resulting in an OOMKill. This is a similar pattern shown in https://github.com/istio/istio/issues/24058. curling the /stats/prometheus endpoint on one of our ingress gateways showed metrics for literally every workload that has run in the cluster since the gateway has existed. Here is the pattern we are seeing:

Is there any progress on this issue being solved?

RichardWLaub on Jan 9, 2023