newrelic-ruby-agent: Hangs with rake task

We have hundreds of rake tasks that are hanging. I’ve run gdb and found there is only the NewRelic thread that is running. The last call is wait_to_run. I’m thinking that the rake task has called exit(0) and for some reason the NewRelic thread isn’t exiting. The version is 5.3.0.346.

["eval:1:in `open'", "eval:1:in `wait_to_run'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/event_loop.rb:118:in `select'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/event_loop.rb:118:in `wait_to_run'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/event_loop.rb:104:in `run_once'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/event_loop.rb:99:in `run'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/agent.rb:619:in `create_and_run_event_loop'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/agent.rb:679:in `block (2 levels) in deferred_work!'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent.rb:495:in `disable_all_tracing'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/agent.rb:676:in `block in deferred_work!'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/agent.rb:656:in `catch_errors'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/agent.rb:675:in `deferred_work!'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/agent.rb:702:in `block in start_worker_thread'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/threading/agent_thread.rb:14:in `block in create'"]

Also, I’m seeing this over and over again in the logs for a process that is hanging:

[2020-05-18 13:24:11 -0400 qa-web (11785)] INFO : No known dispatcher detected.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (5 by maintainers)

Commits related to this issue

Most upvoted comments

A deep dive into this issue reveals that e9539f7 does indeed solve the issues described herein. What’s happening is that an error is occurring when attempting to connect to the collector (variously caused by SSL error, zLib error, or network latency) and the agent then enters into an exponential backoff strategy loop for reconnecting.

If we’re in this reconnect loop and attempt to gracefully shutdown the agent, then we end up waiting indefinitely for a successful connect to the collector server.

The connection happens in a @worker_thread that is launched in the deferred_work! method. By attempting to join for 3 seconds during the shutdown, we’re able to at least attempt to make a successful connection and send any pending data. Failing that within the timeout threshold, we simply exit the thread and continue with the shutdown of the agent.

I have annotated the code with more extensive comments to explain and the comments will appear in the next release of the agent. Meanwhile, you should be able to resolve your current issue by upgrading to the current Agent @ 6.11.0.

Since this is verified and resolved incidentally, I am leaving this ticket closed. If you’re still having issues after upgrading, please route your inquiry through support desk where we can collect more details about your issue to diagnose further.

Reopening for tracking purposes as requested, and we will update when a fix for this issue is released in a new gem version.

I will upgrade today. Probably this is will never happen again.

What happened probably is that:

  • the bug (mishandling) in the agent always existed
  • something in the server changed recently that caused more of these zlib errors
  • there was this ssl certificate that expired and caused hell everywhere as well a couple of weeks back
  • network latency differs for each one

So some of the above or all of the above attributes to the bug surfacing now (or at least more noticeably).

That would explain why not more people have reported this (at least publicly) and why this was not an issue a couple of months back.

I just experienced several more process hangs in production, this time correlated with New Relic unavailability on Jul 29.

@mwlang please reopen this issue until a fix (such as the one proposed in #359) is merged into a public release so that I (and others) can track when it’s fully resolved in a new gem version.

I debugged this issue a few months ago and have some more details to add to @mwlang’s deep dive:

  • The issue is only partially fixed by e9539f7f7bd0da54ea6e108725d54a61629c23a2- that commit ends up working around most of the race-condition issues with its 3-second timeout (most attempts to connect to the server are shorter than this), but it still triggers the same hang situation whenever the agent takes longer than 3 seconds to connect to the server.

  • I just experienced this issue again in production in version 6.10.0.364, triggered by the CloudFlare outage on July 17 which caused the collector.newrelic.com endpoint to experience intermittent latency.

Here is a (truncated) INFO log from a hung process:

[2020-07-17 21:42:08 (6113)] INFO : Starting the New Relic agent version 6.10.0.364 in "production" environment.
[2020-07-17 21:42:08 (6113)] INFO : Environment: production
[2020-07-17 21:42:08 (6113)] INFO : Finished instrumentation
[2020-07-17 21:42:08 (6113)] INFO : Starting Agent shutdown
[2020-07-17 21:42:11 (6113)] ERROR : Error establishing connection with New Relic Service at collector.newrelic.com:443:
[2020-07-17 21:42:11 (6113)] ERROR : Zlib::BufError: buffer error
[2020-07-17 21:42:11 (6113)] INFO : Will re-attempt in 15 seconds
[2020-07-17 21:42:39 (6113)] INFO : Reporting to: https://rpm.newrelic.com/accounts/[...]

Note the ERROR is logged exactly 3 seconds after ‘Starting Agent shutdown’.

  • The reason the fix does not work entirely as described is that this part of the algorithm does not work as described:

Failing that within the timeout threshold, we simply exit the thread and continue with the shutdown of the agent.

While debugging this issue, I found a bug in Ruby’s Net::HTTP where exiting the main thread while a child thread is in the middle of a gzip-encoded HTTP request (a situation that throws the Zlib::BufError seen in the logs) causes the child thread to never exit as expected, hanging the shutdown process. I reported this bug a couple months ago and it will be fixed in the next version of ruby (2.8?).

A more thorough fix would involve either patching Net::HTTP to narrowly fix this bug in previous versions of Ruby, or add raise if Thread.current.status == "aborting" to the top of the rescue clause in Agent#connect (so Zlib::BufError would cause the thread to properly exit anyway): https://github.com/newrelic/newrelic-ruby-agent/blob/bca99037c6a66f4a0068d710c6b23fa6b2281b35/lib/new_relic/agent/agent.rb#L955-L962

I think this issue should be reopened until it’s fully resolved along these lines.

I second this issue.

Here are some of my findings so far.

The stale tasks are connected to zlib errors

[2020-06-01 17:20:22 +0300 as120 (11631)] ERROR : Error establishing connection with New Relic Service at collector.newrelic.com:443:
[2020-06-01 17:20:22 +0300 as120 (11631)] ERROR : Zlib::BufError: buffer error
[2020-06-03 15:32:16 +0300 as120 (15876)] ERROR : Error establishing connection with New Relic Service at collector.newrelic.com:443:
[2020-06-06 15:32:16 +0300 as120 (15876)] ERROR : Zlib::DataError: invalid distance too far back

The agent’s worker is stuck on select

from /var/sites/skroutz_cap/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-6.4.0.356/lib/new_relic/agent/threading/agent_thread.rb:14:in `block in create'
  from /var/sites/skroutz_cap/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-6.4.0.356/lib/new_relic/agent/agent.rb:678:in `block in start_worker_thread'
  from /var/sites/skroutz_cap/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-6.4.0.356/lib/new_relic/agent/agent.rb:651:in `deferred_work!'
  from /var/sites/skroutz_cap/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-6.4.0.356/lib/new_relic/agent/agent.rb:632:in `catch_errors'
  from /var/sites/skroutz_cap/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-6.4.0.356/lib/new_relic/agent/agent.rb:652:in `block in deferred_work!'
  from /var/sites/skroutz_cap/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-6.4.0.356/lib/new_relic/agent.rb:505:in `disable_all_tracing'
  from /var/sites/skroutz_cap/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-6.4.0.356/lib/new_relic/agent/agent.rb:655:in `block (2 levels) in deferred_work!'
  from /var/sites/skroutz_cap/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-6.4.0.356/lib/new_relic/agent/agent.rb:595:in `create_and_run_event_loop'
  from /var/sites/skroutz_cap/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-6.4.0.356/lib/new_relic/agent/event_loop.rb:99:in `run'
  from /var/sites/skroutz_cap/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-6.4.0.356/lib/new_relic/agent/event_loop.rb:104:in `run_once'
  from /var/sites/skroutz_cap/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-6.4.0.356/lib/new_relic/agent/event_loop.rb:118:in `wait_to_run'
  from /var/sites/skroutz_cap/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-6.4.0.356/lib/new_relic/agent/event_loop.rb:118:in `select'

The agent’s worker is a thread created by the agent which sends data to the service and is stuck in select waiting for data.

Why doesn’t the parent thread closes the worker thread after the task has finished?

I can verify this happens in:

Rails 5.1 with newrelic_rpm 6.4 and Rails 4.2 with newrelic_rpm 5.4

There is a kind-of related fix in newrelic_rpm 6.9 but I don’t know if this is related. https://github.com/newrelic/rpm/commit/e9539f7f7bd0da54ea6e108725d54a61629c23a2#

And most importantly, why does this not affect MORE people??