songbird: Suspected busy-wait on a Mutex blocking bot

An issue I keep running into.

Information:

serenity 0.10.5 next
songbird 0.1.6 next
tokio 1.5.0
Rustc 1.51.0
glibc 2.31
Ubuntu 20.04 Linux 5.4.0-70-generic

What happens:

After running for a while, the bot goes offline and begins to consume the maximum CPU allotted to it. Using perf top, this consumption can be fed back to a mutex, and a number of other calls (one called ‘Sleep::new_timeout’ from tokio):

The memory also spikes a little bit, but there appears to be no active leaking occuring.

The bot is in around 8’000 guilds now, and runs autosharded. It usually goes down once every 12-36 hours. The server it is running on has more than enough memory/CPU available to the bot (average total system load 0.5-2.0 out of 12.0) and only runs other Discord bots and a couple of websites

My bot that is having this is at https://github.com/jellywx/soundfx-rs

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 25 (21 by maintainers)

Commits related to this issue

Gateway: Add debug logging around shard handling Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue #69. It's strongly suspe... — committed to FelixMcFelix/songbird by FelixMcFelix 3 years ago
Gateway: Add debug logging around shard handling Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue #69. It's strongly suspe... — committed to FelixMcFelix/songbird by FelixMcFelix 3 years ago
Driver: Fix for busy-wait in WS thread. Debugging work put forth by JellyWx and jtscuba suggests that this WS thread receive is ending up in a permanent failure loop. This adapts the timeout-less rec... — committed to FelixMcFelix/songbird by FelixMcFelix 3 years ago
Driver: Fix for busy-wait in WS thread. (#78) Debugging work put forth by JellyWx and jtscuba suggests that this WS thread receive is ending up in a permanent failure loop. This adapts the timeout-le... — committed to serenity-rs/songbird by FelixMcFelix 3 years ago
Gateway: Add debug logging around shard handling Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue #69. It's strongly suspe... — committed to FelixMcFelix/songbird by FelixMcFelix 3 years ago
Gateway: Add debug logging around shard handling Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue #69. It's strongly suspe... — committed to FelixMcFelix/songbird by FelixMcFelix 3 years ago

Most upvoted comments

…A full week now with no issues, looking very good. Thank you very much both @jtscuba for the far superior debugging and @FelixMcFelix for sorting out a fix for this and having patience with the little information I could gather. Much appreciated!

JellyWX on Jun 11, 2021

I’ve also been hitting this issue and might be able to shed some light on it.

Information

serenity 0.10.5 next
songbird 0.2.0-beta.2 next
tokio 0.2.25
Rustc 1.52.1
MacOS and Ubuntu

What Happens

Varies a little by environment but I’ll focus on my Mac since that’s where I’m able to gather the most info. The memory pressure on my system will spike, eventually causing it to freeze if I don’t kill the bot fast enough.

What I learned

While it was running on my live server I noticed the process was switching between the D and R states was able to use strace to figure out what was going on. In the 1.5 seconds I traced it for, it allocated over 600MB of memory, without exceeding the memory limit of 180 MB. Then it finally did hit that limit, at which point the OOM killer restarted the process.

Now that I knew it was an allocation issue I used the the wonderful cargo instruments package with the MacOS instruments tool to profile the allocations while running it locally. The resulting trace from my first successful capture (after freezing my computer a few times) revealed the issue was an allocation of 384 bytes being allocated and freed super fast. If you want to see the trace, let me know, I’ll have to find a place to upload it since it’s over 4GB though

Compiling in debug mode, I was able to get a useful stack trace and since I knew the allocation size I was able to filter out irrelevant allocations. The problematic ones came from an __rdl_alloc primitive in called from somewhere in tokio and I was able to figure out the only place that many bytes is allocated on a regular basis with the __rd_alloc primitive is here.

Screen Shot 2021-05-30 at 12 52 39 PM

Searching through the Songbird code, the only place where a delay_until call is made (sleep_until tokio 1.x) is here https://github.com/serenity-rs/songbird/blob/7e47d8bcc95d00cad4109dd48bb4e43e6a7e9477/src/driver/tasks/ws.rs#L72

Which you’ll notice happens in a loop.

My hypothesis is that somehow one of those branches is triggering the loop to run with very little to no delay and causing the memory and CPU issue.

Next steps

Hopefully this is enough detail to chase down the problem on your side. If not then I think the next move would be to add a bunch of logging/tracing to figure out which path in that loop is causing the issue and wait for it to happen again. I don’t understand enough of what’s going on in that loop to spot a possible problematic path but it looks super state-full so logging might be needed anyway.

Final thoughts

Reproducing seems to depend on something kinda rare happening on the discord side, which means I’ve had times where it failed in a few minutes and times where it ran with no issues in several days
Running in release mode seems to make the issue worse, or at least I never ran into it until I ran locally in release mode.
@FelixMcFelix thanks for all your hard work maintaining the serenity/songbird libraries. I know how much of a pain it can be maintaining this stuff and appreciate your efforts.

jtscuba on May 30, 2021