go-agent: Memory leak in Go 1.17 (http.Transport)
Description
We started seeing a memory leak issue when we upgraded from Go 1.11.3 (no modules) to 1.17 (with go-modules). In the following graph memory utilization can be seen growing right after the upgrade on Dec 10. We had been running the agent (github.com/newrelic/go-agent/v3 v3.15.2) in Go 1.11.3 without any issues.

This is a high throughput service, the aggregated throughput of all our instances is:

When we ran pprof on one of the instances, we saw something similar to the comment in go/issues/43966:
flat flat% sum% cum cum%
1.50MB 5.07% 5.07% 12.04MB 40.70% net/http.(*Transport).dialConn
0 0% 5.07% 12.04MB 40.70% net/http.(*Transport).dialConnFor
8.03MB 27.14% 32.21% 8.03MB 27.14% bufio.NewWriterSize (inline)
6.02MB 20.36% 52.57% 6.02MB 20.36% bufio.NewReaderSize (inline)
The workaround seems to be to create a new http.Transport each time. Looking at the code the problem could be that collectorDefaultTransport is a Global variable and therefore reused.
https://github.com/newrelic/go-agent/blob/master/v3/newrelic/transport.go#L18
After removing the New Relic agent from our application, we confirmed that the memory leak is gone.
Steps to Reproduce
We are not able to reproduce this in our development environment due to the low traffic there. But in theory, this could be reproduced in an application with a similar load running on Go 1.17 and NR agent version github.com/newrelic/go-agent/v3 v3.15.2
Expected Behavior
Memory utilization should remain constant
Your Environment
Running on Ubuntu:
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
Additional context
There’s a PR that potentially addresses this issue in Go: https://github.com/golang/go/pull/50799
But other libraries we use are not using a Global variable for http.Transport and therefore do not trigger the memory leak. I don’t know if that is the standard way to work with http.Transport
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 5
- Comments: 24 (15 by maintainers)
We’ve managed to mitigate the issue by overriding the transaction pointer with an empty struct:
Results from a high-volume service:
Thank you for bringing this to our attention. We’re looking into this and what the best way to mitigate this would be until the memory leak in
http.Transportis fixed.@iamemilio We had a support case as well created with newrelic. Please let me know if we can have working session. We can try reproducing this issue again.
@iamemilio I wanted to bring this to your attention, that we are using the latest version 1.18 and newrelic go agent v3.17.0
Memory Leak Root Cause Analysis for Datastore Segments
Hi all, Thanks so much for working with us on this issue, and being patient. We worked along side one of the original reporters of the issue, and were able to completely remediate it in their environment. We wanted to post a formal writeup here to explain what the problem was, how we fixed it, and what the next troubleshooting steps for all of you should be.
Root Cause
Here is a very simplified pseudo-code example of what caused the issue.
(1) A context is created to keep track of a flow of execution. This context is created at a very high code scope and exists outside the scope where transactions get created.
(2) This context gets passed into each function that executes a transaction with some business logic on a datastore.
(3) The
doDataTransactionfunction creates a transaction, then calls NewContext to write the transaction pointer to a context. This function is intentionally namedNewContextbecause it calls context.WithValue() which creates a copy of the original context with an added key value pair containing a pointer to the transaction object. By replacing the original context with this newly created context containing the transaction pointer, we are effectively passing the transaction up the stack to themainLoop().(4) The business logic is executed and the transaction is ended. This does not explicitly delete the transaction data on heap by design, in case the metadata needs to be accessed after the transaction ends.
Summary
The assumption being made is that the transaction is out of scope once it is ended in
doDataTransaction, and it will be garbage collected. This is not the case though. Remember that in step 3, we added a pointer to the transaction data into our context which is part of themainLoop()of our application. As long as the context object in the main loop is in scope, the go garbage collector sees that a variable is still in scope that references the data for a given transaction and is unable to delete its memory in heap.Solution Space
Canonically, a context should be used as a way to pass around “common scoped” data within an application. It’s important to be aware of the fact that data can never be removed from contexts, and that data stored inside a context is in scope as long as that context object is also in scope. For these reasons, New Relic recommends against storing transaction data on context objects that exist outside the scope of a transaction.
If using contexts is important to the flow of your application, we recommend that you derive a new context object for each transaction. This will propagate the original context through the chain of function calls safely, without leaking transaction data. The Code that is out of the scope of the current transaction does not need to access this transaction’s data, so writing it to a context object that is out of scope does not make sense.
If your app does not access the transaction data from a context, then it really doesn’t belong there in the first place. In this case you can just pass the pointer down through function arguments.
New Relic strongly recommends following the best practices outlined above. We understand that some users may assume that
txn.End()removes all traces of transaction data when in fact it only signals the end of the transaction. The behavior oftxn.End()is designed to retain that data and that behavior needs to be preserved, however a possible future enhancement could be to add another method, such astxn.Close()which can be called to remove the data explicitly. If this is something the community enthusiastically wants, the go team can consider it but there are trade offs with this approach too. It seems like the go authors thought of this exact scenario when designing the context library, and it’s probably best to follow the guidelines and best practices they laid out:We are using go 1.18 with github.com/newrelic/go-agent/v3 v3.17.0 and seeing the OOM error.
We created a utility function to create the newrelic segments within transaction.
The segment is closed by caller function as below
Is there an update on this issue? We’ve managed to mitigate the memory footprint as describe earlier. However, we’re noticing CPU increase linearly over time on a service that processes periodic tasks. We tested that removing the following line of code resolves the issue:
Graph showing the before and after effect of the change: