opentelemetry-js: Potential memory leak with the prometheus metrics exporter

What happened?

Steps to Reproduce

We’ve instrumented some metrics via the Otel metrics SDK and are exposing it with the prometheus exporter

Expected Result

This allows prometheus scrapes against this service without memory continuously increasing.

Actual Result

We are seeing a very steady increase of memory usage until it hits the heap limit where the service terminates and restarts.

image

Additional Details

OpenTelemetry Setup Code

import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
import { Counter as OtelCounter, Histogram as OtelHistogram, metrics } from '@opentelemetry/api';

const prometheusExporter = new PrometheusExporter({
  preventServerStart: true
});

metrics.setGlobalMeterProvider(meterProvider);

meterProvider.addMetricReader(prometheusExporter);

const meter = metrics.getMeter('controller');

export const fooCounter1: OtelCounter = meter.createCounter('foo_counter_1', {
  description: 'foo'
});

export const fooCounter2: OtelCounter = meter.createCounter('foo_counter_2', {
  description: 'foo.'
});

export const fooHistogram1: OtelHistogram = meter.createHistogram('foo_histogram', {
  description: 'foo'
});

export const getMetrics = (req: IncomingMessage, res: Response, next: NextFunction) => {
  try {
    res.sendStatus(200);
    prometheusExporter.getMetricsRequestHandler(req, res);
  } catch (err) {
    next(err);
  }
};

const router = express.Router();

router.get('/metrics', getMetrics);

export default router;

package.json

{
  "dependencies": {
    "@opentelemetry/api": "1.3.0",
    "@opentelemetry/core": "1.8.0",
    "@opentelemetry/exporter-jaeger": "1.7.0",
    "@opentelemetry/exporter-metrics-otlp-http": "0.34.0",
    "@opentelemetry/exporter-prometheus": "0.34.0",
    "@opentelemetry/exporter-trace-otlp-http": "0.32.0",
    "@opentelemetry/propagator-jaeger": "1.7.0",
    "@opentelemetry/resources": "1.6.0",
    "@opentelemetry/sdk-metrics": "1.8.0",
    "@opentelemetry/sdk-trace-base": "1.6.0",
    "@opentelemetry/sdk-trace-node": "1.6.0",
    "@opentelemetry/semantic-conventions": "1.6.0",
    ...
    }
}

Relevant log output

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 18 (8 by maintainers)

Most upvoted comments

Thanks for the additional info @cyberw, I’ll try to reproduce.

I’m closing this issue as cannot reproduce (single Prometheus exporter setup). If the same problem persists, please feel free to open a new issue and link this one.

is this fixed by #4163 ?

Unfortunately, no. It should solve the problem you had (#4115), but I was never able to reproduce this exact problem reported by @bruce-y (single Prometheus exporter/ metric reader). The fix from #4163 strictly addresses problems with 2+ metric readers.

Any help with reproducing this issue here would be greatly appreciated.

@cyberw I was able to create a minimal reproducer for your reported problem over at #4115 - looks like shortening the export interval makes it very easy to see that memory is leaking. As OP’s reported setup only has one exporter, I’ll not close #4115 as a duplicate as they might be separate issues.

Apologies for the late reply.

Could you provide some more info on the metrics that you’re recording? A known attribute configuration can cause memory leaks are Attributes with many possible values (for instance, a timestamp or user-id); see https://github.com/open-telemetry/opentelemetry-js/issues/2997. Could this be what’s causing your issue?

We aren’t dealing with super high cardinality for a single metric but we are exporting many histograms that contain large amounts of buckets each. I would say it’s likely that there’s potentially 80k lines of metrics when exported.

We’re exporting the same amount via https://github.com/siimon/prom-client without issue though.

@bruce-y what’s your export interval?

The scrape interval we have set locally is every 30 seconds.

Also are you deduping metrics with the same name? One of the first problems I ran into with this ecosystem was treating stats like StatsD, where you can just keep throwing the same name at the API over and over with no consequences. In opentelemetry each counter, histogram, gauge ends up accumulating both memory and CPU time. I had to introduce a lookup table (and be careful with names vs attributes to keep the table from growing unbounded)

We’re exporting the metrics references as a singleton in our application. So anytime something is observed we’re just using the existing metrics object and not instantiating a new one with the same name.

I’m currently tracking an ever increasing CPU usage by aws-otel-collector while using the otlp exporter.

So I’m wondering if this is two bugs, the same bug in two implementations, or a continuous accumulation of data that’s not being cleared during an export (and/or being sent multiple times). Spiralling data would result in memory increases, and spiralling data sets would result in increasing CPU.