opentelemetry-go: Metrics memory leak v1.12.0/v0.35.0 and up

Description

After upgrading from v1.11.2/v.0.34.0 to v1.12.0/v0.35.0 I have witnessed our pods to have a never-ending increase in memory consumption over time. Restoring to the former version brought stable memory use back. Once v1.13.0/v0.36.0 had been released I tried upgrading to that version, but got the same result. Once again, restoring to v1.11.2/v.0.34.0 stopped the memory increase.

Refer to the screenshot below that shows memory use of one of our pods on AWS EKS, specifically:

  • On 1/31/23 I upgraded to v1.12.0/v0.35.0 and instantly you can see the memory grow.
  • On 2/6/23 I made an unrelated release, which reset memory use but it immediately started growing again.
  • On 2/8/23 I downgraded to v1.11.2/v.0.34.0, after which memory use stayed stable again.
  • On 2/14/23 I upgraded to v1.13.0/v0.36.0, and once again saw memory increase. A few new releases made afterwards temporarily reset memory use, as before.
  • On 2/21/23 I downgraded v1.11.2/v.0.34.0, restoring the memory use.

Grafana

Environment

  • OS: Linux
  • Architecture: AWS EKS
  • Go Version: 1.19
  • opentelemetry-go version: v1.12.0/v0.35.0 and up

Steps To Reproduce

  1. Upgrade to v1.12.0/v0.35.0 or higher.
  2. Update metrics from syncint64 etc. to their corresponding instrument versions.
  3. See memory grow over time.

Expected behavior

Memory stays stable.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 4
  • Comments: 20 (9 by maintainers)

Most upvoted comments

I think we’re getting close to the solution. Curling the /metrics endpoint gave me “http_server” count and histogram metrics repeated thousands upon thousands of times (until I quit the command). See the attached text file where I grabbed the first 1000 records.

The main difference between each repeated block of the “http_server” metrics is in the attributes: The net _sock_peer_port keeps increasing. These metrics are not created by my app - they seem to be built-in.

metrics.txt

The following extract shows how we set up the exporter and meter provider:

// Create the metric exporter.
logger.Info("Creating metric exporter.")
metricExporter, err := prometheus.New()
if err != nil {
  logger.Panics(err)
}
// Create a view that filters out a number of dynamic attributes.
filteredView := metric.NewView(
  metric.Instrument{Name: "http.server.*"},
  metric.Stream{
    AttributeFilter: HttpAttributeFilter,
  },
)
// Create a metric provider using the exporter and filtered view.
metricProvider := metric.NewMeterProvider(
  metric.WithReader(metricExporter),
  metric.WithView(filteredView),
)
// Register the meter provider, so elsewhere we can call `global.MeterProvider()` to access it.
global.SetMeterProvider(metricProvider)
// Wrap the registry by a handler, so that collected metrics can be exported via HTTP.
metricHandler := promhttp.Handler()

The new addition we made, based on @MrAlias’ suggestion, is adding that filtered view, which has worked like a charm.

This looks related to https://github.com/open-telemetry/opentelemetry-go/issues/3744

The peer port should not be on a server metric^1.

If you could get data with -inuse_objects as well that could be helpful.

I deployed a fresh instance and will let it run until Monday so I can collect a new sample to give you even more to work with. If there are other pprof extracts you’re interested in let me know, and I’ll see what I can do.

Over the course of some 6 hours I took 5 heap dumps using pprof, see attachment. In those 6 hours only the live and health checks were called every 5 seconds, which resulted in ~4 counters and ~histograms being updated with constant attributes (namely, a string containing the suffix of the URL - either /livez or /healthz - and a boolean whether the call was a success - which it always was).

There are some functions worth looking at it seems: prometheus.MakeLabelPairs and attribute.computeDistinctFixed.

heap.zip