apm-agent-nodejs: Memory leak in version > 3.20.0

Describe the bug

We have upgraded to 3.24.0 all our team services (10) and all of them started running out of memory.

All of them are running in K8S. It would take between 30 min to 3 hours of a pod run out of memory and restart.

We downgraded to 3.20.0 and the issue disappeared.

We were not seeing issue in a staging environment so I believe the issue can be noticed under higher loads.

To Reproduce

Steps to reproduce the behavior: Run http service with high load and observe that memory slowly increases until eventually node process dies with “JavaScript heap out of memory” error.

Expected behavior

Do not run our of memory like version 3.20.0

Environment (please complete the following information)

  • OS: [Linux]
  • Node.js version: 12.22.1
  • APM Server version: 7.15.2
  • Agent version: 3.24.0

How are you starting the agent? (please tick one of the boxes)

  • Calling agent.start() directly (e.g. require('elastic-apm-node').start(...))
  • Requiring elastic-apm-node/start from within the source code
  • Starting node with -r elastic-apm-node/start

Additional context

  • Agent config options

    Click to expand
    replace this line with your agent config options
    
  • package.json dependencies:

    Click to expand
    replace this line with your dependencies section from package.json
    

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 5
  • Comments: 21 (11 by maintainers)

Commits related to this issue

Most upvoted comments

Hello, deployed with asyncHooks: false and elastic-apm-node: 3.25.0 this morning.

Screenshot 2021-12-16 at 11 24 32

Red line is the rolling update of our services. Very stable average memory usage since the change. The two usuals suspects (red and Earls Green lines) are no longer rapidly increasing their memory usage.

Some notes and requests for help for those following this (I see a number of people ❤️ 'd @Tirke’s helpful comment showing that the asyncHooks: false workaround helps):

  • Timing: I was able to start spending real time on this yesterday, but today is my last work day before the end of year holidays so there will be a delay on updates.
  • Ruling out versions v3.22.0 or v3.21.0 of the Node.js APM agent: I strongly suspect the issue is due to a change introduced in v3.24.0 of the APM agent. However, I’ve had one (private) possible note that someone has seen increased memory usage by the agent in v3.22.0. If anyone that is seeing this issue is able to try v3.22.0 and/or v3.21.0 of the agent and confirm whether or not that “fixes” the increased memory usage, that would really help.

Some details on my current understanding/guesses

Internal details that you can skip if not interested: Version 3.24.0 of the Node.js APM agent introduced a significant change in how the agent tracks async tasks. Part of the change involves the agent having a “run context” object for each async task – the things that are put on the Node.js event loop. Those run context objects have a reference to the transaction and spans that a given async task is part of (e.g. the run context for an HTTP request handler has a reference to the APM transaction that collects the trace data for that HTTP request/response).

When a transaction ends (e.g. the HTTP response has been sent), the transaction object stays in memory until all the associated async tasks complete. Two things can happen in an application here that will both result in increased memory usage:

  1. If some of those associated async tasks complete, but only complete long after the response has been sent, then the APM transaction object will stay in memory for that “long” time after the response. For example doSomeTask in the following lives for 10s after the HTTP response has ended:
const apm = require('elastic-apm-node').start()
const express = require('express')
const app = express()
const port = 3000

// Some async task that takes 10s to complete in the background.
async function doSomeTask () {
  return new Promise((resolve) => { 
    setTimeout(resolve, 10000) 
  })
}

app.get('/ping', (req, res) => {
  doSomeTask()
  res.send('pong')
})

app.listen(port, () => {
  console.log(`listening at http://localhost:${port}`)
})

If the application is under constant load, then this results in higher memory usage. The APM agent exacerbates a situation like this because the APM transaction objects can be large (for large requests) compared to a small Promise.

  1. If there is a leak of async tasks in the application (e.g. leaked Promises) then, with the APM agent, those can keep APM transactions alive in memory. By themselves Promises are very small, and an application could leak many of them without noticing much increased memory usage. However, with indirectly attached APM transactions, the memory usage will be made much worse. One way to see if your application has a growing number of Promises in memory is by including this snippet of code (and the APM agent can be disabled to exclude it):
const promises = new Set()
require('async_hooks').createHook({
  init: (asyncId, type) => {
    if (type === 'PROMISE') {
      promises.add(asyncId)
    }
  },
  destroy: (asyncId) => {
    promises.delete(asyncId)
  }
}).enable()
setInterval(() => {
  console.log(`Promises in memory: ${promises.size}`)
}, 5000)

All,

elastic-apm-node@3.27.0 was just released, which includes a fix (https://github.com/elastic/apm-agent-nodejs/pull/2528) for potential high-memory use in versions 3.24.0 - 3.26.0 of the agent. I hope this resolves the issues that all or most of you were seeing, but I cannot be sure.

If/when you are able to try v3.27.0 – and without the asyncHooks: false workaround – I would love to hear whether or not this fixes the high-memory usage you were seeing. If not, then if you are able, I’d like to work with you to try to understand what you are seeing.


The quick summary of my understanding of the issue, if you are interested, is:

  • Starting with v3.24.0, a side-effect of improvements in the agent’s tracking of asynchronous tasks was that a current APM Transaction object is not freed until all its async tasks end. Before this it was freed when the transaction ended.
  • This means a possibly longer lifetime for those APM Transaction objects, depending on the user application – whether the application code results in async tasks that out-live the transaction in which they were started.
  • Those longer-lived Transaction objects were holding references to HTTP request/response objects, which meant higher memory usage for an application under constant load.

Hi @trentm , I’m working with @Tirke and we’ve just release an update on our stack with the latest patch of APM (3.27.0) without using the asyncHooks workaround and so far so good.

Screenshot 2022-01-20 at 21 47 48

Thanks a lot for your support.

@trentm we have the same issue with this version on particular services, most of the services are doing well, I will provide the report a bit later