apm-agent-nodejs: Unexpected APM Server response when polling config

Describe the bug

We have a problem with our ES cluster, resulting in this error:

Error: Unexpected APM Server response when polling config
     at processConfigErrorResponse (.../node_modules/elastic-apm-http-client/index.js:711:15)

which is obviously our problem. But the error is unhandled and taking down the process. Is that the expected behaviour? I can get over missing apm data, but having my service restart constantly is less than ideal. I don’t want my uptime constrained to that of the APM server.

To Reproduce

Ruin your ES

Expected behavior

Is there any way to handle this kind of error, and e.g. just log it? I found this, but I don’t want to trap all exceptions from the process (we already have a handler for that, to log & exit). Can I do something like:

apm.on("error", err => { log(err); })

Am I missing something obvious? Thanks.

Environment (please complete the following information)

OS: Debian 9
Node.js version: 12.16.3 (LTS)
APM Server version: 7.6.1
Agent version: 3.5

How are you starting the agent? (please tick one of the boxes)

Calling agent.start() directly (e.g. require('elastic-apm-node').start(...))
Requiring elastic-apm-node/start from within the source code
Starting node with -r elastic-apm-node/start

Additional context

Using env vars for config

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 15
Comments: 32 (11 by maintainers)

Commits related to this issue

fix: crash on error polling config with node v12 and keep-alive agent (#138) This will fix https://github.com/elastic/apm-agent-nodejs/issues/1749 when the agent updates to a release of this client. — committed to elastic/apm-nodejs-http-client by trentm 3 years ago

Most upvoted comments

I fluked and found a local repro for this.

tl;dr: It only happens with node v12 (for me at least) and only a second failed attempt to retrieve central config from APM server.

Working from this part of the traceback above:

 Emitted 'error' event on TLSSocket instance at:
     at emitErrorNT (internal/streams/destroy.js:92:8)
     at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
     at processTicksAndRejections (internal/process/task_queues.js:84:21) {

I was able to match those line numbers to node v12’s version of internal/streams/destroy.js. Using this patch to node_modules/elastic-apm-http-client/index.js (importantly the part of reducing the time for the next poll) and running a hello-world app using APM with node v12:

% node --version
v12.19.1

itself configured to talk to a mock APM-server that responds with 503 to central config requests, resulted in the crash.

I haven’t dug in any further yet.

trentm on Feb 19, 2021

I finally got back to this… and couldn’t repro for a while because I hadn’t been specific about my “hello world app” before.

better details on a repro

Here are better details on a repro.

Using the following demo app:

var apm = require('elastic-apm-node').start({
  serviceName: 'issue1749-two-config-crash-app',
  metricsInterval: 0,
  // captureExceptions:false so we can see the error on the console, rather
  // than just in the error object sent to APM server.
  captureExceptions: false,
  logLevel: 'debug',
  // disableInstrumentations: ['express']
})

if (process.env.PLZ_REQUIRE_EXPRESS) {
  require('express')
}
const http = require('http')
const server = http.createServer((req, res) => {
  res.setHeader('Content-Type', 'text/plain')
  res.end('hi')
})
const port = 3000
server.listen(port, () => {
  console.log(`listening at http://localhost:${port}`)
})

with the following patch to “node_modules/elastic-apm-http-client/index.js” (mainly to get the shorter _scheduleNextConfigPoll() value):

diff --git a/index.js b/index.js
index c27fb71..52ae377 100644
--- a/index.js
+++ b/index.js
@@ -163,12 +163,16 @@ Client.prototype._pollConfig = function () {
   }

   const req = this._transport.get(opts, res => {
+    console.warn('XXX res.statusCode', res.statusCode)
     res.on('error', err => {
       // Not sure this event can ever be emitted, but just in case
+      console.warn('XXX in res.on(error), calling res.destroy(err)', err)
       res.destroy(err)
     })

-    this._scheduleNextConfigPoll(getMaxAge(res))
+    // XXX
+    // this._scheduleNextConfigPoll(getMaxAge(res))
+    this._scheduleNextConfigPoll(3)

     if (
       res.statusCode === 304 || // No new config since last time
@@ -180,26 +184,36 @@ Client.prototype._pollConfig = function () {
     }

     streamToBuffer(res, (err, buf) => {
-      if (err) return res.destroy(err)
+      if (err) {
+        console.warn('XXX streamToBuffer error, calling res.destroy', err)
+        return res.destroy(err)
+      }

       if (res.statusCode === 200) {
         // 200: New config available (or no config for the given service.name / service.environment)
         const etag = res.headers.etag
         if (etag) this._conf.lastConfigEtag = etag

+        // XXX Should not emit in the try/catch. If listener throws, then we catch it here.
         try {
           this.emit('config', JSON.parse(buf))
         } catch (e) {
+          console.warn('XXX config parse error, calling res.destroy', e)
           res.destroy(e)
         }
       } else {
-        res.destroy(processConfigErrorResponse(res, buf))
+        // XXX
+        console.warn('XXX config res.statusCode !== 200, calling res.destroy', res.statusCode, )
+        var cfgErr = processConfigErrorResponse(res, buf)
+        res.destroy(cfgErr)
+        // res.destroy(processConfigErrorResponse(res, buf))
       }
     })
   })

   req.on('error', err => {
-    this._scheduleNextConfigPoll()
+    console.warn('XXX req.on error', err)
+    this._scheduleNextConfigPoll(3)
     this.emit('request-error', err)
   })
 }
@@ -213,6 +227,7 @@ Client.prototype._scheduleNextConfigPoll = function (seconds) {
     this._configTimer = null
     this._pollConfig()
   }, seconds * 1000)
+  console.warn('XXX _scheduleNextConfigPoll in %ss', seconds)

   this._configTimer.unref()
 }

and letting it run for long enough (~15-20s) to make a second centralConfig poll

Running node foo.js -> no crash Running PLZ_REQUIRE_EXPRESS=1 node foo.js -> crash

I.e.: just require’ing “express” makes a difference here.

More specific version details for the record:

% npm ls express
play@1.0.1 /Users/trentm/tm/play
└── express@4.17.1

% npm ls elastic-apm-node
play@1.0.1 /Users/trentm/tm/play
└── elastic-apm-node@3.10.0  -> /Users/trentm/el/apm-agent-nodejs5


[11:20:35 trentm@pink:~/el/apm-agent-nodejs5 (git:tags/v3.10.0)]
% git log1 -1
* 68f27d3 - (HEAD, tag: v3.10.0) 3.10.0 (#1936) (6 weeks ago) <Trent Mick>

% git diff --staged
diff --git a/package.json b/package.json
index 6aef98b..da46f93 100644
--- a/package.json
+++ b/package.json
@@ -87,7 +87,7 @@
     "console-log-level": "^1.4.1",
     "cookie": "^0.4.0",
     "core-util-is": "^1.0.2",
-    "elastic-apm-http-client": "^9.4.2",
+    "elastic-apm-http-client": "file:../apm-nodejs-http-client3",
     "end-of-stream": "^1.4.4",
     "error-stack-parser": "^2.0.6",
     "escape-string-regexp": "^4.0.0",

[11:21:34 trentm@pink:~/el/apm-nodejs-http-client3 (git:tags/v9.4.2)]
% git log1 -1
* 5a501f5 - (HEAD, tag: v9.4.2) 9.4.2 (3 months ago) <Trent Mick>

fails with express@5 as well:

% npm ls express
play@1.0.1 /Users/trentm/tm/play
└── express@5.0.0-alpha.8

does not crash if the agent is configured with: disableInstrumentations: [‘express’]

the issue

Working backwards from the crash point.

The crash is from an unhandled “error” event on a socket instance:

...
Emitted 'error' event on Socket instance at:
    at emitErrorNT (internal/streams/destroy.js:92:8)
...

That “error” event is being emitted due to the agent’s apm-server client calling res.destroy(err) after an non-200 response from polling the APM server for config here. IncomingMessage.destroy([error]) emits “error” on the socket.

So why only on the second poll to APM Server?
The apm-server client creates an http.Agent with keepAlive=true by default. The intent is to use that keep-alive agent for requests to the APM server. However, due to a bug in #53, the agent (set as this._agent) isn’t set until after the request options for apm-server requests are set.

This means that this._agent === undefined here when request options are set. In other words, the keep-alive agent isn’t getting used.
However, if client.config(...) is called (again) on the client after initial creation, then those request options will be created and this time this._agent will be set.

There is one current case in the Node.js APM Agent where client.config() is called again: in agent.setFramework(), which is called automatically in these module instrumentations:
```
modules/hapi.js:  agent.setFramework({ name: 'hapi', version, overwrite: false })
modules/restify.js:  agent.setFramework({ name: 'restify', version, overwrite: false })
modules/koa.js:  agent.setFramework({ name: 'koa', version, overwrite: false })
modules/express.js:  agent.setFramework({ name: 'express', version, overwrite: false })
modules/fastify.js:  agent.setFramework({ name: 'fastify', version, overwrite: false })
```
The result is that if any of those 5 modules is required in a user app, then the apm-server client will switch to using its keep-alive agent. The apm-server client always immediately polls for central config, so this switch to the keep-alive agent will only occur for 2nd and subsequent calls.
So why isn’t there an “error” event listener on that res.socket? Any why only in node v12? Because of this “TODO” in node core’s http client handling of keep alive after the “end” of response: https://github.com/nodejs/node/blob/v12.19.1/lib/_http_client.js#L625-L626
```
socket.removeListener('error', socketErrorListener);
socket.removeListener('data', socketOnData);
socket.removeListener('end', socketOnEnd);

// TODO(ronag): Between here and emitFreeNT the socket
// has no 'error' handler.
```
In other words, if you res.destroy(err) after the res “end” event in node v12 with a keep-alive agent, you’ll be emitting an “error” event on a socket with no error event listener. That results in a crash or uncaughtException handler.

a smaller repro demo

var http = require('http')
var agent = new http.Agent({keepAlive: true})
var opts = {
  agent: agent,
  hostname: 'www.google.com',
  method: 'GET',
  path: '/'
}

function makeReq() {
  var req = http.get(opts, function (res) {
      console.warn('res.socket "error" event listeners before end:', res.socket.listeners('error'))
      res.on('data', function (chunk) { })
      res.on('end', function () {
        console.warn('res.on end')
        console.warn('res.socket "error" event listeners after end:', res.socket.listeners('error'))
        res.destroy(new Error('call the whole thing off'))
      })
      res.on('error', function (err) {
        console.warn('res.on error', err)
      })
  })
  req.on('error', function (err) {
    console.warn('req.on error', err)
  })
}

makeReq()

setTimeout(function () {}, 10000000) // don't exit right away

Run that “bar.js” with node v12:

% node bar.js
res.socket "error" event listeners before end: [ [Function: socketErrorListener] ]
res.on end
res.socket "error" event listeners after end: []
events.js:291
      throw er; // Unhandled 'error' event
      ^

Error: call the whole thing off
    at IncomingMessage.<anonymous> (/Users/trentm/tm/play/bar.js:17:21)
    at IncomingMessage.emit (events.js:326:22)
    at endReadableNT (_stream_readable.js:1223:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)
Emitted 'error' event on Socket instance at:
    at emitErrorNT (internal/streams/destroy.js:92:8)
    at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)

Run it with NODE_DEBUG=http to see more of what is going on in the node core http client.

% NODE_DEBUG=http node bar.js
HTTP 64387: call onSocket 0 0
HTTP 64387: createConnection www.google.com:80: {
  agent: Agent {
    _events: [Object: null prototype] {
      free: [Function],
      newListener: [Function: maybeEnableKeylog]
    },
    _eventsCount: 2,
    _maxListeners: undefined,
    defaultPort: 80,
    protocol: 'http:',
    options: { keepAlive: true, path: null },
    requests: {},
    sockets: { 'www.google.com:80:': [] },
    freeSockets: {},
    keepAliveMsecs: 1000,
    keepAlive: true,
    maxSockets: Infinity,
    maxFreeSockets: 256,
    maxTotalSockets: Infinity,
    totalSocketCount: 0,
    [Symbol(kCapture)]: false
  },
  hostname: 'www.google.com',
  method: 'GET',
  path: null,
  port: 80,
  host: 'www.google.com',
  keepAlive: true,
  servername: 'www.google.com',
  _agentKey: 'www.google.com:80:'
}
HTTP 64387: sockets www.google.com:80: 1 1
HTTP 64387: outgoing message end.
(node:64387) Warning: Setting the NODE_DEBUG environment variable to 'http' can expose sensitive data (such as passwords, tokens and authentication headers) in the resulting log.
HTTP 64387: AGENT incoming response!
res.socket "error" event listeners before end: [ [Function: socketErrorListener] ]
HTTP 64387: AGENT socket keep-alive
res.on end
res.socket "error" event listeners after end: []
HTTP 64387: CLIENT socket onFree
HTTP 64387: agent.on(free) www.google.com:80:
events.js:291
      throw er; // Unhandled 'error' event
      ^

Error: call the whole thing off
    at IncomingMessage.<anonymous> (/Users/trentm/tm/play/bar.js:17:21)
    at IncomingMessage.emit (events.js:326:22)
    at endReadableNT (_stream_readable.js:1223:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)
Emitted 'error' event on Socket instance at:
    at emitErrorNT (internal/streams/destroy.js:92:8)
    at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)

I’ll start a fix on Monday.

trentm on Feb 20, 2021

P.S. is there any workaround that i can apply in the meanwhile ?

Our workaround was to dump Elastic for APM. The lack of response around this issue was not confidence inspiring so we decided to evaluate alternative solutions instead. Not a great answer, but we just couldn’t have an agent crash our container instances every time we applied an update to our Elastic cluster.

pklime2 on Sep 7, 2020

Fixed now with elastic-apm-node@3.12.1 Thanks all, esp. @grahamrhay, for the reporting and persistence.

trentm on Feb 25, 2021

Reopening. The GH PR linking automatically closed this, but this won’t actually be fixed until there is an elastic-apm-http-client v9.5.1 release and a release of this agent to require that version.

trentm on Feb 25, 2021

@itsTeknas @pklime2 there is a very simple workaround mentioned in the comments above. Set centralConfig: false in the APM agent options.

MartinKolarik on Sep 7, 2020

Error: Unexpected APM Server response when polling config
    at processConfigErrorResponse (.../node_modules/elastic-apm-http-client/index.js:711:15)
    at .../node_modules/elastic-apm-http-client/index.js:196:21
    at IncomingMessage.<anonymous> (.../node_modules/fast-stream-to-buffer/index.js:20:9)
    at IncomingMessage.f (.../node_modules/once/once.js:25:25)
    at IncomingMessage.onend (.../node_modules/end-of-stream/index.js:36:27)
    at IncomingMessage.emit (events.js:322:22)
    at endReadableNT (_stream_readable.js:1187:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)
Emitted 'error' event on TLSSocket instance at: 
    at emitErrorNT (internal/streams/destroy.js:92:8)
    at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
    at processTicksAndRejections (internal/process/task_queues.js:84:21) {
  code: 503,
  response: '{"statusCode":429,"error":"Too Many Requests","message":"[circuit_breaking_exception] [parent] Data too large, data for [<http_request>] would be [4065427520/3.7gb], which is larger than the limit of [4063657984/3.7gb], real usage: [4065427520/3.7gb], new bytes reserved: [0/0b], usages [request=16440/16kb, fielddata=341/341b, in_flight_requests=509722/497.7kb, accounting=1359924147/1.2gb], with { bytes_wanted=4065427520 & bytes_limit=4063657984 & durability=\\"PERMANENT\\" }"}'

is that any use?

grahamrhay on May 15, 2020