songbird: Suspected busy-wait on a Mutex blocking bot
An issue I keep running into.
Information:
- serenity 0.10.5
next
- songbird 0.1.6
next
- tokio 1.5.0
- Rustc 1.51.0
- glibc 2.31
- Ubuntu 20.04 Linux 5.4.0-70-generic
What happens:
After running for a while, the bot goes offline and begins to consume the maximum CPU allotted to it. Using perf top
, this consumption can be fed back to a mutex, and a number of other calls (one called ‘Sleep::new_timeout’ from tokio):
The memory also spikes a little bit, but there appears to be no active leaking occuring.
The bot is in around 8’000 guilds now, and runs autosharded. It usually goes down once every 12-36 hours. The server it is running on has more than enough memory/CPU available to the bot (average total system load 0.5-2.0 out of 12.0) and only runs other Discord bots and a couple of websites
My bot that is having this is at https://github.com/jellywx/soundfx-rs
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 25 (21 by maintainers)
Commits related to this issue
- Gateway: Add debug logging around shard handling Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue #69. It's strongly suspe... — committed to FelixMcFelix/songbird by FelixMcFelix 3 years ago
- Gateway: Add debug logging around shard handling Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue #69. It's strongly suspe... — committed to FelixMcFelix/songbird by FelixMcFelix 3 years ago
- Driver: Fix for busy-wait in WS thread. Debugging work put forth by JellyWx and jtscuba suggests that this WS thread receive is ending up in a permanent failure loop. This adapts the timeout-less rec... — committed to FelixMcFelix/songbird by FelixMcFelix 3 years ago
- Driver: Fix for busy-wait in WS thread. (#78) Debugging work put forth by JellyWx and jtscuba suggests that this WS thread receive is ending up in a permanent failure loop. This adapts the timeout-le... — committed to serenity-rs/songbird by FelixMcFelix 3 years ago
- Gateway: Add debug logging around shard handling Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue #69. It's strongly suspe... — committed to FelixMcFelix/songbird by FelixMcFelix 3 years ago
- Gateway: Add debug logging around shard handling Adds some additional logging around some critical sections, rarely hit (i.e., during shard reconnections) in pursuit of issue #69. It's strongly suspe... — committed to FelixMcFelix/songbird by FelixMcFelix 3 years ago
…A full week now with no issues, looking very good. Thank you very much both @jtscuba for the far superior debugging and @FelixMcFelix for sorting out a fix for this and having patience with the little information I could gather. Much appreciated!
I’ve also been hitting this issue and might be able to shed some light on it.
Information
What Happens
Varies a little by environment but I’ll focus on my Mac since that’s where I’m able to gather the most info. The memory pressure on my system will spike, eventually causing it to freeze if I don’t kill the bot fast enough.
What I learned
While it was running on my live server I noticed the process was switching between the
D
andR
states was able to usestrace
to figure out what was going on. In the 1.5 seconds I traced it for, it allocated over 600MB of memory, without exceeding the memory limit of180 MB
. Then it finally did hit that limit, at which point the OOM killer restarted the process.Now that I knew it was an allocation issue I used the the wonderful cargo instruments package with the MacOS instruments tool to profile the allocations while running it locally. The resulting trace from my first successful capture (after freezing my computer a few times) revealed the issue was an allocation of 384 bytes being allocated and freed super fast. If you want to see the trace, let me know, I’ll have to find a place to upload it since it’s over 4GB though
Compiling in debug mode, I was able to get a useful stack trace and since I knew the allocation size I was able to filter out irrelevant allocations. The problematic ones came from an
__rdl_alloc
primitive in called from somewhere in tokio and I was able to figure out the only place that many bytes is allocated on a regular basis with the__rd_alloc
primitive is here.Searching through the Songbird code, the only place where a
delay_until
call is made (sleep_until
tokio 1.x) is here https://github.com/serenity-rs/songbird/blob/7e47d8bcc95d00cad4109dd48bb4e43e6a7e9477/src/driver/tasks/ws.rs#L72Which you’ll notice happens in a
loop
.My hypothesis is that somehow one of those branches is triggering the loop to run with very little to no delay and causing the memory and CPU issue.
Next steps
Hopefully this is enough detail to chase down the problem on your side. If not then I think the next move would be to add a bunch of logging/tracing to figure out which path in that loop is causing the issue and wait for it to happen again. I don’t understand enough of what’s going on in that loop to spot a possible problematic path but it looks super state-full so logging might be needed anyway.
Final thoughts