go-agent: Memory leak in Go 1.17 (http.Transport)

Description

We started seeing a memory leak issue when we upgraded from Go 1.11.3 (no modules) to 1.17 (with go-modules). In the following graph memory utilization can be seen growing right after the upgrade on Dec 10. We had been running the agent (github.com/newrelic/go-agent/v3 v3.15.2) in Go 1.11.3 without any issues.

mem-leak

This is a high throughput service, the aggregated throughput of all our instances is: image

When we ran pprof on one of the instances, we saw something similar to the comment in go/issues/43966:

      flat  flat%   sum%        cum   cum%
    1.50MB  5.07%  5.07%    12.04MB 40.70%  net/http.(*Transport).dialConn
         0     0%  5.07%    12.04MB 40.70%  net/http.(*Transport).dialConnFor
    8.03MB 27.14% 32.21%     8.03MB 27.14%  bufio.NewWriterSize (inline)
    6.02MB 20.36% 52.57%     6.02MB 20.36%  bufio.NewReaderSize (inline)

The workaround seems to be to create a new http.Transport each time. Looking at the code the problem could be that collectorDefaultTransport is a Global variable and therefore reused. https://github.com/newrelic/go-agent/blob/master/v3/newrelic/transport.go#L18

After removing the New Relic agent from our application, we confirmed that the memory leak is gone.

Steps to Reproduce

We are not able to reproduce this in our development environment due to the low traffic there. But in theory, this could be reproduced in an application with a similar load running on Go 1.17 and NR agent version github.com/newrelic/go-agent/v3 v3.15.2

Expected Behavior

Memory utilization should remain constant

Your Environment

Running on Ubuntu:

Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.5 LTS
Release:	14.04
Codename:	trusty

Additional context

There’s a PR that potentially addresses this issue in Go: https://github.com/golang/go/pull/50799

But other libraries we use are not using a Global variable for http.Transport and therefore do not trigger the memory leak. I don’t know if that is the standard way to work with http.Transport

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 5
  • Comments: 24 (15 by maintainers)

Most upvoted comments

We’ve managed to mitigate the issue by overriding the transaction pointer with an empty struct:

defer func() {
	txn.End()
	*txn = newrelic.Transaction{}
}()

Results from a high-volume service: Screen Shot 2022-02-15 at 8 56 11 PM

Thank you for bringing this to our attention. We’re looking into this and what the best way to mitigate this would be until the memory leak in http.Transport is fixed.

@iamemilio We had a support case as well created with newrelic. Please let me know if we can have working session. We can try reproducing this issue again.

@iamemilio I wanted to bring this to your attention, that we are using the latest version 1.18 and newrelic go agent v3.17.0

Memory Leak Root Cause Analysis for Datastore Segments

Hi all, Thanks so much for working with us on this issue, and being patient. We worked along side one of the original reporters of the issue, and were able to completely remediate it in their environment. We wanted to post a formal writeup here to explain what the problem was, how we fixed it, and what the next troubleshooting steps for all of you should be.

Root Cause

Here is a very simplified pseudo-code example of what caused the issue.

func mainLoop(app *newrelic.Application, db dbConfig, job chan Job) {
    ctx :=  context.TODO()                        (1)
 
    for {
        select {
        case <- ctx.Done():
           break
        case <- job:
           doDataTransaction(ctx, app, job)       (2)
        }
    }
}
 
func doDataTransaction(ctx context.Context, app *newrelic.Application, db dbConfig, job Job)
   txn := app.StartTransaction("my data transaction")
   ctx = newrelic.NewContext(ctx, txn)             (3)
 
   for _, dataOperation := range job.Batch() {
       doBusinessLogicSegment(ctx, db, dataOperation)
   }
 
   txn.End()                                       (4)
}

(1) A context is created to keep track of a flow of execution. This context is created at a very high code scope and exists outside the scope where transactions get created.

(2) This context gets passed into each function that executes a transaction with some business logic on a datastore.

(3) The doDataTransaction function creates a transaction, then calls NewContext to write the transaction pointer to a context. This function is intentionally named NewContext because it calls context.WithValue() which creates a copy of the original context with an added key value pair containing a pointer to the transaction object. By replacing the original context with this newly created context containing the transaction pointer, we are effectively passing the transaction up the stack to the mainLoop().

(4) The business logic is executed and the transaction is ended. This does not explicitly delete the transaction data on heap by design, in case the metadata needs to be accessed after the transaction ends.

Summary

The assumption being made is that the transaction is out of scope once it is ended in doDataTransaction, and it will be garbage collected. This is not the case though. Remember that in step 3, we added a pointer to the transaction data into our context which is part of the mainLoop() of our application. As long as the context object in the main loop is in scope, the go garbage collector sees that a variable is still in scope that references the data for a given transaction and is unable to delete its memory in heap.

Solution Space

Canonically, a context should be used as a way to pass around “common scoped” data within an application. It’s important to be aware of the fact that data can never be removed from contexts, and that data stored inside a context is in scope as long as that context object is also in scope. For these reasons, New Relic recommends against storing transaction data on context objects that exist outside the scope of a transaction.

If using contexts is important to the flow of your application, we recommend that you derive a new context object for each transaction. This will propagate the original context through the chain of function calls safely, without leaking transaction data. The Code that is out of the scope of the current transaction does not need to access this transaction’s data, so writing it to a context object that is out of scope does not make sense.

func mainLoop(app *newrelic.Application, db dbConfig, job chan Job) {
    ctx :=  context.TODO()                        
 
    for {
        select {
        case <- ctx.Done():
           break
        case <- job:
           doDataTransaction(ctx, app, job)       
        }
    }
}
 
func doDataTransaction(ctx context.Context, app *newrelic.Application, db dbConfig, job Job)
   txn := app.StartTransaction("my data transaction")

   // derive a new context to pass common data to code handling segments of this transaction
   txnContext := newrelic.NewContext(ctx, txn)        
 
   for _, dataOperation := range job.Batch() {
       doBusinessLogicSegment(txnContext, db, dataOperation)
   }
 
   txn.End()                                       
}

If your app does not access the transaction data from a context, then it really doesn’t belong there in the first place. In this case you can just pass the pointer down through function arguments.

func mainLoop(app *newrelic.Application, db dbConfig, job chan Job) {
    ctx :=  context.TODO()                        
 
    for {
        select {
        case <- ctx.Done():
           break
        case <- job:
           doDataTransaction(ctx, app, job)       
        }
    }
}
 
func doDataTransaction(ctx context.Context, app *newrelic.Application, db dbConfig, job Job)
   txn := app.StartTransaction("my data transaction")          
 
   for _, dataOperation := range job.Batch() {
        // pass the transaction pointer directly to the functions handling segments of this transaction
        doBusinessLogicSegment(ctx, db, dataOperation, txn)
   }
 
   txn.End()                                       
}

New Relic strongly recommends following the best practices outlined above. We understand that some users may assume that txn.End() removes all traces of transaction data when in fact it only signals the end of the transaction. The behavior of txn.End() is designed to retain that data and that behavior needs to be preserved, however a possible future enhancement could be to add another method, such as txn.Close() which can be called to remove the data explicitly. If this is something the community enthusiastically wants, the go team can consider it but there are trade offs with this approach too. It seems like the go authors thought of this exact scenario when designing the context library, and it’s probably best to follow the guidelines and best practices they laid out:

Incoming requests to a server should create a Context, and outgoing calls to servers should accept a Context. The chain of function calls between them must propagate the Context, optionally replacing it with a derived Context created using WithCancel, WithDeadline, WithTimeout, or WithValue. When a Context is canceled, all Contexts derived from it are also canceled.

The WithCancel, WithDeadline, and WithTimeout functions take a Context (the parent) and return a derived Context (the child) and a CancelFunc. Calling the CancelFunc cancels the child and its children, removes the parent’s reference to the child, and stops any associated timers. Failing to call the CancelFunc leaks the child and its children until the parent is canceled or the timer fires. The go vet tool checks that CancelFuncs are used on all control-flow paths.

We are using go 1.18 with github.com/newrelic/go-agent/v3 v3.17.0 and seeing the OOM error.

image

We created a utility function to create the newrelic segments within transaction.

func logNewRelicSegment(ctx context.Context, query string, sqlStatement string, dbconfig dbConfig) *newrelic.DatastoreSegment {
	//logging should be non-disruptive. any error during this is absorbed
	defer func() {
		if r := recover(); r != nil {
			log.Debug(ctx, fmt.Sprintf("Panic absorbed: %v", r))
		}
	}()
	txn := newrelic.FromContext(ctx)
	operation := extractOperation(ctx, query)
	collection := extractTableName(ctx, operation, query)
	seg := &newrelic.DatastoreSegment{
		StartTime:          txn.StartSegmentNow(),
		Product:            newrelic.DatastorePostgres,
		Collection:         collection,
		Operation:          operation,
		ParameterizedQuery: sqlStatement,
		Host:               dbconfig.Host,
		PortPathOrID:       strconv.Itoa(dbconfig.Port),
		DatabaseName:       dbconfig.Name,
	}
	return seg
}

The segment is closed by caller function as below

//new relic segment here
	seg := logNewRelicSegment(ctx, query, sqlStatement, txn.Conn.DBConfig)
	defer seg.End()

Is there an update on this issue? We’ve managed to mitigate the memory footprint as describe earlier. However, we’re noticing CPU increase linearly over time on a service that processes periodic tasks. We tested that removing the following line of code resolves the issue:

ctx = newrelic.NewContext(ctx, txn)

Graph showing the before and after effect of the change: Screen Shot 2022-03-09 at 10 55 36 AM