songbird: Suspected busy-wait on a Mutex blocking bot

An issue I keep running into.

Information:

  • serenity 0.10.5 next
  • songbird 0.1.6 next
  • tokio 1.5.0
  • Rustc 1.51.0
  • glibc 2.31
  • Ubuntu 20.04 Linux 5.4.0-70-generic

What happens:

After running for a while, the bot goes offline and begins to consume the maximum CPU allotted to it. Using perf top, this consumption can be fed back to a mutex, and a number of other calls (one called ‘Sleep::new_timeout’ from tokio):

a b

The memory also spikes a little bit, but there appears to be no active leaking occuring.

The bot is in around 8’000 guilds now, and runs autosharded. It usually goes down once every 12-36 hours. The server it is running on has more than enough memory/CPU available to the bot (average total system load 0.5-2.0 out of 12.0) and only runs other Discord bots and a couple of websites

My bot that is having this is at https://github.com/jellywx/soundfx-rs

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 25 (21 by maintainers)

Commits related to this issue

Most upvoted comments

…A full week now with no issues, looking very good. Thank you very much both @jtscuba for the far superior debugging and @FelixMcFelix for sorting out a fix for this and having patience with the little information I could gather. Much appreciated!

I’ve also been hitting this issue and might be able to shed some light on it.

Information

  • serenity 0.10.5 next
  • songbird 0.2.0-beta.2 next
  • tokio 0.2.25
  • Rustc 1.52.1
  • MacOS and Ubuntu

What Happens

Varies a little by environment but I’ll focus on my Mac since that’s where I’m able to gather the most info. The memory pressure on my system will spike, eventually causing it to freeze if I don’t kill the bot fast enough.

What I learned

While it was running on my live server I noticed the process was switching between the D and R states was able to use strace to figure out what was going on. In the 1.5 seconds I traced it for, it allocated over 600MB of memory, without exceeding the memory limit of 180 MB. Then it finally did hit that limit, at which point the OOM killer restarted the process.

Now that I knew it was an allocation issue I used the the wonderful cargo instruments package with the MacOS instruments tool to profile the allocations while running it locally. The resulting trace from my first successful capture (after freezing my computer a few times) revealed the issue was an allocation of 384 bytes being allocated and freed super fast. If you want to see the trace, let me know, I’ll have to find a place to upload it since it’s over 4GB though

Compiling in debug mode, I was able to get a useful stack trace and since I knew the allocation size I was able to filter out irrelevant allocations. The problematic ones came from an __rdl_alloc primitive in called from somewhere in tokio and I was able to figure out the only place that many bytes is allocated on a regular basis with the __rd_alloc primitive is here.

Screen Shot 2021-05-30 at 12 52 39 PM

Searching through the Songbird code, the only place where a delay_until call is made (sleep_until tokio 1.x) is here https://github.com/serenity-rs/songbird/blob/7e47d8bcc95d00cad4109dd48bb4e43e6a7e9477/src/driver/tasks/ws.rs#L72

Which you’ll notice happens in a loop.

My hypothesis is that somehow one of those branches is triggering the loop to run with very little to no delay and causing the memory and CPU issue.

Next steps

Hopefully this is enough detail to chase down the problem on your side. If not then I think the next move would be to add a bunch of logging/tracing to figure out which path in that loop is causing the issue and wait for it to happen again. I don’t understand enough of what’s going on in that loop to spot a possible problematic path but it looks super state-full so logging might be needed anyway.

Final thoughts

  1. Reproducing seems to depend on something kinda rare happening on the discord side, which means I’ve had times where it failed in a few minutes and times where it ran with no issues in several days
  2. Running in release mode seems to make the issue worse, or at least I never ran into it until I ran locally in release mode.
  3. @FelixMcFelix thanks for all your hard work maintaining the serenity/songbird libraries. I know how much of a pain it can be maintaining this stuff and appreciate your efforts.