newrelic-ruby-agent: Thread tracing causes process to hang

Description

We recently attempted to upgrade the version of the agent we use from 8.16 to 9.0 and ran into an issue where, after our code had completed a Sidekiq job, the process would be spinning at 100% CPU usage seemingly indefinitely.

We were able to prevent this issue by disabling the thread tracing (setting instrumentation.thread.tracing to false), so it seems like this issue may have been present for a while but only manifested with the automatic enabling of thread tracing that was included in 9.0.

Notably, this only seems to occur for a particular job that we have. We use Sidekiq for job processing, and this issue would manifest for a particularly long-running Sidekiq job that made many database requests. Reducing the number of DB requests made by that job prevented this issue from manifesting.

Expected Behavior

We expect the New Relic agent to publish the traces quickly and without using 100% of a CPU.

Steps to Reproduce

I wish I had more specific steps for you here — the job we’re looking at takes over ten minutes to run so it’s a long debug cycle. We were not able to reproduce the issue when running the job outside of Sidekiq, so it seems like maybe some interaction of the threading used by Sidekiq with the thread tracing done in NewRelic.

If this seems worth digging into then I can look further into creating a reproducible example, let me know!

Your Environment

Running on Heroku 18 (Ubuntu 18.04). New Relic Ruby Agent: 9.0.0 Ruby: 3.1.2 Sidekiq: 6.5.6

For Maintainers Only or Hero Triaging this bug

Suggested Priority (P1,P2,P3,P4,P5): Suggested T-Shirt size (S, M, L, XL, Unknown):

About this issue

Original URL
State: closed
Created a year ago
Reactions: 1
Comments: 17 (7 by maintainers)

Most upvoted comments

2 full days of running the must_go_faster branch and we haven’t had a single stuck process. Things seem to be working much better.

travisbell on Apr 13, 2023

Hi @kaylareopelle, fantastic work. I don’t mind bumping to do some testing for you guys. I’ll deploy later today and report back what I find out.

travisbell on Apr 11, 2023

Hi @travisbell, @parkerfinch! The changes provided by https://github.com/newrelic/newrelic-ruby-agent/pull/1927 have been released in v9.2.0 of the newrelic_rpm gem.

kaylareopelle on Apr 17, 2023

Thanks, @parkerfinch! If the agent itself is hogging all of the CPU but still managing to operate, its activity can be seen by temporarily having it operate in debug level logging mode, and/or by enabling the optional audit log. These are the two relevant newrelic.yml config options:

  log_level: debug
  audit_log.enabled: true

Those options aren’t good for production use, but helpful for temporary debugging.

With debug level logging, log/newrelic_agent.log should continue to see constant activity while the agent is still running and with the audit log enabled, log/newrelic_audit.log will see new content every time some data is sent up to New Relic’s servers.

On our end, we now have a repro that will spike the CPU and hang the monitored application for a period of time only when thread instrumentation is enabled. The application eventually recovers after the agent has finished going through a backlog of thread related work, so it’s not a complete and permanent hang we’re seeing.

We are now evaluating a variety of ways to improve things while keeping thread instrumentation enabled, and we will share our findings and improvement PRs here as they become available.

If anyone else has a repro that they can share that does not include any sensitive data, we’d also be happy to test the fix against any* repro shared with us. Otherwise, we’ll have a feature branch prepped that others can pull down and test with.

fallwith on Apr 4, 2023

Thanks @fallwith.

Alright, 3 nights of running Resque without thread tracing and we didn’t have a single hanging process. That’s the first time in ~3 weeks I can say that so I feel pretty confident this was our culprit. Hard to imagine it’s a coincidence at this point but I will reply back in case I see it happen again.

To add some extra details about our ENV:

New Relic Ruby Agent 9.0.0
Ruby 3.2.1
Resque 2.5.0
resque-multi-job-forks

And to re-iterate, it seems to affect long running jobs and not short quick ones. Long running for us means ~2-3 hours.

travisbell on Mar 27, 2023