newrelic-ruby-agent: Hangs with rake task
We have hundreds of rake tasks that are hanging. I’ve run gdb and found there is only the NewRelic thread that is running. The last call is wait_to_run. I’m thinking that the rake task has called exit(0) and for some reason the NewRelic thread isn’t exiting. The version is 5.3.0.346.
["eval:1:in `open'", "eval:1:in `wait_to_run'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/event_loop.rb:118:in `select'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/event_loop.rb:118:in `wait_to_run'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/event_loop.rb:104:in `run_once'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/event_loop.rb:99:in `run'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/agent.rb:619:in `create_and_run_event_loop'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/agent.rb:679:in `block (2 levels) in deferred_work!'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent.rb:495:in `disable_all_tracing'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/agent.rb:676:in `block in deferred_work!'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/agent.rb:656:in `catch_errors'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/agent.rb:675:in `deferred_work!'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/agent.rb:702:in `block in start_worker_thread'", "/var/www/shared/bundle/ruby/2.3.0/gems/newrelic_rpm-5.3.0.346/lib/new_relic/agent/threading/agent_thread.rb:14:in `block in create'"]
Also, I’m seeing this over and over again in the logs for a process that is hanging:
[2020-05-18 13:24:11 -0400 qa-web (11785)] INFO : No known dispatcher detected.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (5 by maintainers)
Commits related to this issue
- Prevent connecting agent thread from hanging on shutdown A bug in `Net::HTTP`'s Gzip decoder can cause the (un-catchable) thread-kill exception to be replaced with a (catchable) `Zlib` exception, whic... — committed to wjordan/newrelic-ruby-agent by wjordan 4 years ago
- Prevent connecting agent thread from hanging on shutdown A bug in `Net::HTTP`'s Gzip decoder can cause the (un-catchable) thread-kill exception to be replaced with a (catchable) `Zlib` exception, whic... — committed to newrelic/newrelic-ruby-agent by wjordan 4 years ago
A deep dive into this issue reveals that e9539f7 does indeed solve the issues described herein. What’s happening is that an error is occurring when attempting to connect to the collector (variously caused by SSL error, zLib error, or network latency) and the agent then enters into an exponential backoff strategy loop for reconnecting.
If we’re in this reconnect loop and attempt to gracefully shutdown the agent, then we end up waiting indefinitely for a successful connect to the collector server.
The connection happens in a @worker_thread that is launched in the deferred_work! method. By attempting to join for 3 seconds during the shutdown, we’re able to at least attempt to make a successful connection and send any pending data. Failing that within the timeout threshold, we simply exit the thread and continue with the shutdown of the agent.
I have annotated the code with more extensive comments to explain and the comments will appear in the next release of the agent. Meanwhile, you should be able to resolve your current issue by upgrading to the current Agent @ 6.11.0.
Since this is verified and resolved incidentally, I am leaving this ticket closed. If you’re still having issues after upgrading, please route your inquiry through support desk where we can collect more details about your issue to diagnose further.
Reopening for tracking purposes as requested, and we will update when a fix for this issue is released in a new gem version.
I will upgrade today. Probably this is will never happen again.
What happened probably is that:
So some of the above or all of the above attributes to the bug surfacing now (or at least more noticeably).
That would explain why not more people have reported this (at least publicly) and why this was not an issue a couple of months back.
I just experienced several more process hangs in production, this time correlated with New Relic unavailability on Jul 29.
@mwlang please reopen this issue until a fix (such as the one proposed in #359) is merged into a public release so that I (and others) can track when it’s fully resolved in a new gem version.
I debugged this issue a few months ago and have some more details to add to @mwlang’s deep dive:
The issue is only partially fixed by e9539f7f7bd0da54ea6e108725d54a61629c23a2- that commit ends up working around most of the race-condition issues with its 3-second timeout (most attempts to connect to the server are shorter than this), but it still triggers the same hang situation whenever the agent takes longer than 3 seconds to connect to the server.
I just experienced this issue again in production in version 6.10.0.364, triggered by the CloudFlare outage on July 17 which caused the
collector.newrelic.comendpoint to experience intermittent latency.Here is a (truncated)
INFOlog from a hung process:Note the
ERRORis logged exactly 3 seconds after ‘Starting Agent shutdown’.While debugging this issue, I found a bug in Ruby’s
Net::HTTPwhere exiting the main thread while a child thread is in the middle of a gzip-encoded HTTP request (a situation that throws theZlib::BufErrorseen in the logs) causes the child thread to never exit as expected, hanging the shutdown process. I reported this bug a couple months ago and it will be fixed in the next version of ruby (2.8?).A more thorough fix would involve either patching
Net::HTTPto narrowly fix this bug in previous versions of Ruby, or addraise if Thread.current.status == "aborting"to the top of the rescue clause inAgent#connect(soZlib::BufErrorwould cause the thread to properly exit anyway): https://github.com/newrelic/newrelic-ruby-agent/blob/bca99037c6a66f4a0068d710c6b23fa6b2281b35/lib/new_relic/agent/agent.rb#L955-L962I think this issue should be reopened until it’s fully resolved along these lines.
I second this issue.
Here are some of my findings so far.
The stale tasks are connected to zlib errors
The agent’s worker is stuck on
selectThe agent’s worker is a thread created by the agent which sends data to the service and is stuck in
selectwaiting for data.Why doesn’t the parent thread closes the worker thread after the task has finished?
I can verify this happens in:
Rails 5.1 with newrelic_rpm 6.4 and Rails 4.2 with newrelic_rpm 5.4
There is a kind-of related fix in newrelic_rpm 6.9 but I don’t know if this is related. https://github.com/newrelic/rpm/commit/e9539f7f7bd0da54ea6e108725d54a61629c23a2#
And most importantly, why does this not affect MORE people??