bandwhich: Bandwhich appears to have a serious memory leak

I built and ran bandwhich on a few machines. After a while of running, it dies. I see this in my logs

[Sat Sep  9 14:11:34 2023] [ 910813]     0 910813 25425140  6316768 76255232  3166464           200 bandwhich
[Sat Sep  9 14:11:34 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=zerotier-one.service,mems_allowed=0,global_oom,task_memcg=/system.slice/snap.bandwhich.bandwhich-92de3b50-fb9a-430c-9645-eac2cde77fe0.scope,task=bandwhich,pid=910813,uid=0
[Sat Sep  9 14:11:34 2023] Out of memory: Killed process 910813 (bandwhich) total-vm:101700560kB, anon-rss:25267072kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:74468kB oom_score_adj:200
[Sat Sep  9 14:11:37 2023] oom_reaper: reaped process 910813 (bandwhich), now anon-rss:88kB, file-rss:0kB, shmem-rss:0kB

While I’ve seen this on remote, small servers. I have also seen it on my workstation which has 32GB RAM and 22GB swap.

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 39

Commits related to this issue

Most upvoted comments

Good news, rustix has backported this fix to all minor versions up to 0.35, which is the first minor version affected. This means we do not block on an update from procfs (I have notified them nonetheless).

Since I am pretty confident that we’ve fixed the exact problem, I will close this issue as completed. Please reopen and/or comment in case you are still seeing it on main, thanks.

Hi all, quick update.

Thanks to the collaborative effort of @sigmaSd, @konnorandrews, and @sunfishcode (rustix maintainer), we were able to diagnose the issue and reproduce the fault condition in rustix directly. They will likely patch it in the next release.

Since we currently depend on procfs 0.15.1, which depends on an older, incompatible version of rustix, we will not be able to update our indirect dependency on rustix immediately. I will notify @eminence (procfs maintainer) as soon as the rustix patch is available.

Hi all, I went hunting for the bug after cyqsimon brought it up in the community discord server. Looks like I have found the issue and it is in rustix’s linux_raw backend.

As cyqsimon already pointed out the Dir struct for the linux_raw backend contains a Vec<u8>: https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L21.

From the heaptrace the allocation is coming from the Dir::read() method here: https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L55

At https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L78 the code in read() checks if it has run out of items and needs to read more, it then calls Dir::read_more(). Immediately upon entering Dir::read_more(), it calls resize() on the self.buf (https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L142). However, it uses this new size:

self.buf.resize(self.buf.capacity() + 32 * size_of::<linux_dirent64>(), 0);

This causes self.buf to grow every time read_more() is called without bound because of the usage of self.buf.capacity(). I believe the issue is resolved by simply changing self.buf.capacity() to self.buf.len() as directly after this self.buf.resize(nread, 0); is called to truncate the Vec to the actual number of elements (note this does not deallocate anything).

In normal operation self.buf.capacity() should never need to be more than a few multiples of 32 * size_of::<linux_dirent64>() if not exactly that.

Actually, speaking of glibc, can all also try the MUSL build from release v0.21.0? It’s circumstantial evidence at best but it might give us some clues.

here’s a flamegraph for when bandwhich ends up taking up too much memory:

flamegraph

Ok, trying that build didn’t work either. I don’t tend to sit and watch it, but leave it running and look at the terminal when I remember. I just looked and it’s using 70GB after 25 minutes.

python3 bin/ps_mem.py | grep bandwhich                                                                                                              nuc: Mon Oct  9 13:47:48 2023

 23.4 GiB +  46.8 GiB =  70.2 GiB       bandwhich

@popey please try @cyqsimon changes https://github.com/imsnif/bandwhich/issues/284?notification_referrer_id=NT_kwDOAVY157M3NjYyOTgwMTc3OjIyNDI3MTEx#issuecomment-1752113867 also you don’t need to wait for it to oom, if it reach lets say 500mb we can consider that the bug is still there

Left the MUSL build running for 2.5 hours, and no OOM crash yet. Will leave it overnight.

Okay, can all please try the bump-procfs-rustix branch?

You can either pull the branch and build locally or use the CI build.

If this still does not fix this issue, I will go ahead and submit bug reports in procfs and rustix repositories. And in the absolute worst case scenario, we may have actually found a bug in the glibc allocator 🥶.

Incorrect diagnosis. See my newer comment.

Good good. From the backtrace I was able to track the allocations back to a single line.

https://github.com/imsnif/bandwhich/blob/8c6be282a530b14f80854766275f6a27f7dfa31e/src/os/linux.rs#L21

It seems like your hypothesis is correct @imsnif, albeit of a different hashmap. I’ll take a look at how this stored data is used and how to best implement some sort of cleanup.


Edit:

Perhaps I shouldn’t rule out the possibility that it’s caused by file handles not being closed. I’ll make sure that isn’t the case first.