bandwhich: Bandwhich appears to have a serious memory leak

I built and ran bandwhich on a few machines. After a while of running, it dies. I see this in my logs

[Sat Sep  9 14:11:34 2023] [ 910813]     0 910813 25425140  6316768 76255232  3166464           200 bandwhich
[Sat Sep  9 14:11:34 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=zerotier-one.service,mems_allowed=0,global_oom,task_memcg=/system.slice/snap.bandwhich.bandwhich-92de3b50-fb9a-430c-9645-eac2cde77fe0.scope,task=bandwhich,pid=910813,uid=0
[Sat Sep  9 14:11:34 2023] Out of memory: Killed process 910813 (bandwhich) total-vm:101700560kB, anon-rss:25267072kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:74468kB oom_score_adj:200
[Sat Sep  9 14:11:37 2023] oom_reaper: reaped process 910813 (bandwhich), now anon-rss:88kB, file-rss:0kB, shmem-rss:0kB

While I’ve seen this on remote, small servers. I have also seen it on my workstation which has 32GB RAM and 22GB swap.

About this issue

Original URL
State: closed
Created 10 months ago
Comments: 39

Commits related to this issue

Bump `rustix` dependencies - Fixes #284 - See https://github.com/bytecodealliance/rustix/security/advisories/GHSA-c827-hfw6-qwvm — committed to imsnif/bandwhich by cyqsimon 9 months ago
Release 0.2.1 Includes dependency update to fix security issue: https://github.com/imsnif/bandwhich/issues/284#issuecomment-1754321993 — committed to TheRealLorenz/dotlink by TheRealLorenz 8 months ago
Release 0.2.1 Includes dependency update to fix security issue: https://github.com/imsnif/bandwhich/issues/284#issuecomment-1754321993 — committed to TheRealLorenz/dotlink by TheRealLorenz 8 months ago

Most upvoted comments

Good news, rustix has backported this fix to all minor versions up to 0.35, which is the first minor version affected. This means we do not block on an update from procfs (I have notified them nonetheless).

Since I am pretty confident that we’ve fixed the exact problem, I will close this issue as completed. Please reopen and/or comment in case you are still seeing it on main, thanks.

cyqsimon on Oct 16, 2023

Hi all, quick update.

Thanks to the collaborative effort of @sigmaSd, @konnorandrews, and @sunfishcode (rustix maintainer), we were able to diagnose the issue and reproduce the fault condition in rustix directly. They will likely patch it in the next release.

Since we currently depend on procfs 0.15.1, which depends on an older, incompatible version of rustix, we will not be able to update our indirect dependency on rustix immediately. I will notify @eminence (procfs maintainer) as soon as the rustix patch is available.

cyqsimon on Oct 11, 2023

Hi all, I went hunting for the bug after cyqsimon brought it up in the community discord server. Looks like I have found the issue and it is in rustix’s linux_raw backend.

As cyqsimon already pointed out the Dir struct for the linux_raw backend contains a Vec<u8>: https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L21.

From the heaptrace the allocation is coming from the Dir::read() method here: https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L55

At https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L78 the code in read() checks if it has run out of items and needs to read more, it then calls Dir::read_more(). Immediately upon entering Dir::read_more(), it calls resize() on the self.buf (https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L142). However, it uses this new size:

self.buf.resize(self.buf.capacity() + 32 * size_of::<linux_dirent64>(), 0);

This causes self.buf to grow every time read_more() is called without bound because of the usage of self.buf.capacity(). I believe the issue is resolved by simply changing self.buf.capacity() to self.buf.len() as directly after this self.buf.resize(nread, 0); is called to truncate the Vec to the actual number of elements (note this does not deallocate anything).

In normal operation self.buf.capacity() should never need to be more than a few multiples of 32 * size_of::<linux_dirent64>() if not exactly that.

konnorandrews on Oct 10, 2023

Actually, speaking of glibc, can all also try the MUSL build from release v0.21.0? It’s circumstantial evidence at best but it might give us some clues.

cyqsimon on Oct 8, 2023

here’s a flamegraph for when bandwhich ends up taking up too much memory:

flamegraph

terminaldweller on Oct 4, 2023

Ok, trying that build didn’t work either. I don’t tend to sit and watch it, but leave it running and look at the terminal when I remember. I just looked and it’s using 70GB after 25 minutes.

python3 bin/ps_mem.py | grep bandwhich                                                                                                              nuc: Mon Oct  9 13:47:48 2023

 23.4 GiB +  46.8 GiB =  70.2 GiB       bandwhich

popey on Oct 9, 2023

@popey please try @cyqsimon changes https://github.com/imsnif/bandwhich/issues/284?notification_referrer_id=NT_kwDOAVY157M3NjYyOTgwMTc3OjIyNDI3MTEx#issuecomment-1752113867 also you don’t need to wait for it to oom, if it reach lets say 500mb we can consider that the bug is still there

sigmaSd on Oct 9, 2023

Left the MUSL build running for 2.5 hours, and no OOM crash yet. Will leave it overnight.

popey on Oct 8, 2023

Okay, can all please try the bump-procfs-rustix branch?

You can either pull the branch and build locally or use the CI build.

If this still does not fix this issue, I will go ahead and submit bug reports in procfs and rustix repositories. And in the absolute worst case scenario, we may have actually found a bug in the glibc allocator 🥶.

cyqsimon on Oct 8, 2023

Incorrect diagnosis. See my newer comment.

Good good. From the backtrace I was able to track the allocations back to a single line.

https://github.com/imsnif/bandwhich/blob/8c6be282a530b14f80854766275f6a27f7dfa31e/src/os/linux.rs#L21

It seems like your hypothesis is correct @imsnif, albeit of a different hashmap. I’ll take a look at how this stored data is used and how to best implement some sort of cleanup.

Edit:

Perhaps I shouldn’t rule out the possibility that it’s caused by file handles not being closed. I’ll make sure that isn’t the case first.

cyqsimon on Oct 8, 2023