istio: Telemetry v2 envoy needs metric expiry

Bug description Mixer in telemetry v1 had the concept of metric expiry, from memory it was 10 mins (any series not seen in 10 minutes were dropped).

That automatically cleaned up things that no longer exit. We have a cluster continually running tests with workloads that reflect those tests.

In v2, these never get cleaned up and envoys keep reference to them until those envoys are restarted:

Screenshot 2020-02-22 at 08 49 04

cc. @douglas-reid @mandarjog

Expected behavior Metrics should expire like they used to do in v1.

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm) 1.4.5

How was Istio installed?

Environment where bug was observed (cloud vendor, OS, etc)

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 20
  • Comments: 30 (25 by maintainers)

Most upvoted comments

Hum. This would be very useful feature indeed.

What kind of work is needed here? Does this involve some work on Envoy? Or only on the stats filter?

Any updates? Definitely having the same issue and would be great to have the expiry on v2.

Hi @douglas-reid @mandarjog @howardjohn just curious what if this is still a priority since this would be an issue for the use cases where people want to use it on production and over a period of time the prometheus scraping metrics from envoy pods wont scale anymore.

Here’s Envoy side issue for tracking this; https://github.com/envoyproxy/envoy/issues/14070

What kind of work is needed here? Does this involve some work on Envoy? Or only on the stats filter?

Yes this needs support at Envoy stats system.

Moving to 1.12 due to recent inactivity

I believe this issue is causing our ingress-gateways to be OOMKilled. We are showing a linear increase in both memory usage and transmit bandwidth on our ingress gateways over a roughly 2-month period resulting in an OOMKill. This is a similar pattern shown in https://github.com/istio/istio/issues/24058. curling the /stats/prometheus endpoint on one of our ingress gateways showed metrics for literally every workload that has run in the cluster since the gateway has existed. Here is the pattern we are seeing:

Screenshot 2023-01-09 at 1 53 15 PM Screenshot 2023-01-05 at 2 36 08 PM

Is there any progress on this issue being solved?