bandwhich: Bandwhich appears to have a serious memory leak
I built and ran bandwhich on a few machines. After a while of running, it dies. I see this in my logs
[Sat Sep 9 14:11:34 2023] [ 910813] 0 910813 25425140 6316768 76255232 3166464 200 bandwhich
[Sat Sep 9 14:11:34 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=zerotier-one.service,mems_allowed=0,global_oom,task_memcg=/system.slice/snap.bandwhich.bandwhich-92de3b50-fb9a-430c-9645-eac2cde77fe0.scope,task=bandwhich,pid=910813,uid=0
[Sat Sep 9 14:11:34 2023] Out of memory: Killed process 910813 (bandwhich) total-vm:101700560kB, anon-rss:25267072kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:74468kB oom_score_adj:200
[Sat Sep 9 14:11:37 2023] oom_reaper: reaped process 910813 (bandwhich), now anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
While I’ve seen this on remote, small servers. I have also seen it on my workstation which has 32GB RAM and 22GB swap.
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 39
Commits related to this issue
- Bump `rustix` dependencies - Fixes #284 - See https://github.com/bytecodealliance/rustix/security/advisories/GHSA-c827-hfw6-qwvm — committed to imsnif/bandwhich by cyqsimon 9 months ago
- Release 0.2.1 Includes dependency update to fix security issue: https://github.com/imsnif/bandwhich/issues/284#issuecomment-1754321993 — committed to TheRealLorenz/dotlink by TheRealLorenz 8 months ago
- Release 0.2.1 Includes dependency update to fix security issue: https://github.com/imsnif/bandwhich/issues/284#issuecomment-1754321993 — committed to TheRealLorenz/dotlink by TheRealLorenz 8 months ago
Good news, rustix has backported this fix to all minor versions up to 0.35, which is the first minor version affected. This means we do not block on an update from procfs (I have notified them nonetheless).
Since I am pretty confident that we’ve fixed the exact problem, I will close this issue as completed. Please reopen and/or comment in case you are still seeing it on
main
, thanks.Hi all, quick update.
Thanks to the collaborative effort of @sigmaSd, @konnorandrews, and @sunfishcode (rustix maintainer), we were able to diagnose the issue and reproduce the fault condition in rustix directly. They will likely patch it in the next release.
Since we currently depend on procfs 0.15.1, which depends on an older, incompatible version of rustix, we will not be able to update our indirect dependency on rustix immediately. I will notify @eminence (procfs maintainer) as soon as the rustix patch is available.
Hi all, I went hunting for the bug after cyqsimon brought it up in the community discord server. Looks like I have found the issue and it is in rustix’s linux_raw backend.
As cyqsimon already pointed out the
Dir
struct for the linux_raw backend contains aVec<u8>
: https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L21.From the heaptrace the allocation is coming from the
Dir::read()
method here: https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L55At https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L78 the code in
read()
checks if it has run out of items and needs to read more, it then callsDir::read_more()
. Immediately upon enteringDir::read_more()
, it callsresize()
on theself.buf
(https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L142). However, it uses this new size:This causes
self.buf
to grow every timeread_more()
is called without bound because of the usage ofself.buf.capacity()
. I believe the issue is resolved by simply changingself.buf.capacity()
toself.buf.len()
as directly after thisself.buf.resize(nread, 0);
is called to truncate the Vec to the actual number of elements (note this does not deallocate anything).In normal operation
self.buf.capacity()
should never need to be more than a few multiples of32 * size_of::<linux_dirent64>()
if not exactly that.Actually, speaking of
glibc
, can all also try the MUSL build from release v0.21.0? It’s circumstantial evidence at best but it might give us some clues.here’s a flamegraph for when bandwhich ends up taking up too much memory:
Ok, trying that build didn’t work either. I don’t tend to sit and watch it, but leave it running and look at the terminal when I remember. I just looked and it’s using 70GB after 25 minutes.
@popey please try @cyqsimon changes https://github.com/imsnif/bandwhich/issues/284?notification_referrer_id=NT_kwDOAVY157M3NjYyOTgwMTc3OjIyNDI3MTEx#issuecomment-1752113867 also you don’t need to wait for it to oom, if it reach lets say 500mb we can consider that the bug is still there
Left the MUSL build running for 2.5 hours, and no OOM crash yet. Will leave it overnight.
Okay, can all please try the
bump-procfs-rustix
branch?You can either pull the branch and build locally or use the CI build.
If this still does not fix this issue, I will go ahead and submit bug reports in
procfs
andrustix
repositories. And in the absolute worst case scenario, we may have actually found a bug in theglibc
allocator 🥶.Incorrect diagnosis. See my newer comment.
Good good. From the backtrace I was able to track the allocations back to a single line.
https://github.com/imsnif/bandwhich/blob/8c6be282a530b14f80854766275f6a27f7dfa31e/src/os/linux.rs#L21
It seems like your hypothesis is correct @imsnif, albeit of a different hashmap. I’ll take a look at how this stored data is used and how to best implement some sort of cleanup.
Edit:
Perhaps I shouldn’t rule out the possibility that it’s caused by file handles not being closed. I’ll make sure that isn’t the case first.