opentelemetry-go: Metrics memory leak v1.12.0/v0.35.0 and up
Description
After upgrading from v1.11.2/v.0.34.0 to v1.12.0/v0.35.0 I have witnessed our pods to have a never-ending increase in memory consumption over time. Restoring to the former version brought stable memory use back. Once v1.13.0/v0.36.0 had been released I tried upgrading to that version, but got the same result. Once again, restoring to v1.11.2/v.0.34.0 stopped the memory increase.
Refer to the screenshot below that shows memory use of one of our pods on AWS EKS, specifically:
- On 1/31/23 I upgraded to v1.12.0/v0.35.0 and instantly you can see the memory grow.
- On 2/6/23 I made an unrelated release, which reset memory use but it immediately started growing again.
- On 2/8/23 I downgraded to v1.11.2/v.0.34.0, after which memory use stayed stable again.
- On 2/14/23 I upgraded to v1.13.0/v0.36.0, and once again saw memory increase. A few new releases made afterwards temporarily reset memory use, as before.
- On 2/21/23 I downgraded v1.11.2/v.0.34.0, restoring the memory use.
Environment
- OS: Linux
- Architecture: AWS EKS
- Go Version: 1.19
- opentelemetry-go version: v1.12.0/v0.35.0 and up
Steps To Reproduce
- Upgrade to v1.12.0/v0.35.0 or higher.
- Update metrics from
syncint64
etc. to their correspondinginstrument
versions. - See memory grow over time.
Expected behavior
Memory stays stable.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 4
- Comments: 20 (9 by maintainers)
I think we’re getting close to the solution. Curling the
/metrics
endpoint gave me “http_server” count and histogram metrics repeated thousands upon thousands of times (until I quit the command). See the attached text file where I grabbed the first 1000 records.The main difference between each repeated block of the “http_server” metrics is in the attributes: The
net _sock_peer_port
keeps increasing. These metrics are not created by my app - they seem to be built-in.metrics.txt
The following extract shows how we set up the exporter and meter provider:
The new addition we made, based on @MrAlias’ suggestion, is adding that filtered view, which has worked like a charm.
This looks related to https://github.com/open-telemetry/opentelemetry-go/issues/3744
The peer port should not be on a server metric^1.
If you could get data with
-inuse_objects
as well that could be helpful.I deployed a fresh instance and will let it run until Monday so I can collect a new sample to give you even more to work with. If there are other pprof extracts you’re interested in let me know, and I’ll see what I can do.
Over the course of some 6 hours I took 5 heap dumps using pprof, see attachment. In those 6 hours only the live and health checks were called every 5 seconds, which resulted in ~4 counters and ~histograms being updated with constant attributes (namely, a string containing the suffix of the URL - either /livez or /healthz - and a boolean whether the call was a success - which it always was).
There are some functions worth looking at it seems: prometheus.MakeLabelPairs and attribute.computeDistinctFixed.
heap.zip