newrelic-ruby-agent: Thread tracing causes process to hang
Description
We recently attempted to upgrade the version of the agent we use from 8.16 to 9.0 and ran into an issue where, after our code had completed a Sidekiq job, the process would be spinning at 100% CPU usage seemingly indefinitely.
We were able to prevent this issue by disabling the thread tracing (setting instrumentation.thread.tracing to false), so it seems like this issue may have been present for a while but only manifested with the automatic enabling of thread tracing that was included in 9.0.
Notably, this only seems to occur for a particular job that we have. We use Sidekiq for job processing, and this issue would manifest for a particularly long-running Sidekiq job that made many database requests. Reducing the number of DB requests made by that job prevented this issue from manifesting.
Expected Behavior
We expect the New Relic agent to publish the traces quickly and without using 100% of a CPU.
Steps to Reproduce
I wish I had more specific steps for you here — the job we’re looking at takes over ten minutes to run so it’s a long debug cycle. We were not able to reproduce the issue when running the job outside of Sidekiq, so it seems like maybe some interaction of the threading used by Sidekiq with the thread tracing done in NewRelic.
If this seems worth digging into then I can look further into creating a reproducible example, let me know!
Your Environment
Running on Heroku 18 (Ubuntu 18.04). New Relic Ruby Agent: 9.0.0 Ruby: 3.1.2 Sidekiq: 6.5.6
For Maintainers Only or Hero Triaging this bug
Suggested Priority (P1,P2,P3,P4,P5): Suggested T-Shirt size (S, M, L, XL, Unknown):
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 17 (7 by maintainers)
2 full days of running the
must_go_fasterbranch and we haven’t had a single stuck process. Things seem to be working much better.Hi @kaylareopelle, fantastic work. I don’t mind bumping to do some testing for you guys. I’ll deploy later today and report back what I find out.
Hi @travisbell, @parkerfinch! The changes provided by https://github.com/newrelic/newrelic-ruby-agent/pull/1927 have been released in v9.2.0 of the
newrelic_rpmgem.Thanks, @parkerfinch! If the agent itself is hogging all of the CPU but still managing to operate, its activity can be seen by temporarily having it operate in debug level logging mode, and/or by enabling the optional audit log. These are the two relevant
newrelic.ymlconfig options:Those options aren’t good for production use, but helpful for temporary debugging.
With debug level logging,
log/newrelic_agent.logshould continue to see constant activity while the agent is still running and with the audit log enabled,log/newrelic_audit.logwill see new content every time some data is sent up to New Relic’s servers.On our end, we now have a repro that will spike the CPU and hang the monitored application for a period of time only when thread instrumentation is enabled. The application eventually recovers after the agent has finished going through a backlog of thread related work, so it’s not a complete and permanent hang we’re seeing.
We are now evaluating a variety of ways to improve things while keeping thread instrumentation enabled, and we will share our findings and improvement PRs here as they become available.
If anyone else has a repro that they can share that does not include any sensitive data, we’d also be happy to test the fix against any* repro shared with us. Otherwise, we’ll have a feature branch prepped that others can pull down and test with.
Thanks @fallwith.
Alright, 3 nights of running Resque without thread tracing and we didn’t have a single hanging process. That’s the first time in ~3 weeks I can say that so I feel pretty confident this was our culprit. Hard to imagine it’s a coincidence at this point but I will reply back in case I see it happen again.
To add some extra details about our ENV:
And to re-iterate, it seems to affect long running jobs and not short quick ones. Long running for us means ~2-3 hours.