go: testing: apparent memory leak in fuzzer

If I leave a fuzzer running for sufficiently long, then it crashes with an OOM. Snippet from dmesg:

[1974087.246830] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-319973.slice/session-c1.scope,task=json.test,pid=3659815,uid=319973
[1974087.264733] Out of memory: Killed process 3659815 (json.test) total-vm:18973836kB, anon-rss:13185376kB, file-rss:0kB, shmem-rss:0kB, UID:319973 pgtables:33988kB oom_score_adj:0
[1974087.971181] oom_reaper: reaped process 3659815 (json.test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I don’t believe this is because the code being tested is OOMing, but rather the fuzzer itself is retaining too much memory.

Here’s a graph of the RSS memory usage over time: graph The machine has 32GiB of RAM.

There are large jumps in memory usage at various intervals. I don’t have much understanding of how the fuzzer works, but maybe this could be the mutator discovering that some input expands coverage and adding that to some internal data structure?

I should further note that when the fuzzer crashes, it produces a testdata/FuzzXXX/YYY file as the “reproducer”. Running the test with that “reproducer” does not fail the fuzz test. If possible, the fuzzer should be able to distinguish between OOMs due to itself and versus OOMs due to the code being tested. The former should not result in any “repro” corpus files being added, while the latter should.

I’m using 5aacd47c002c39b481c4c7a0663e851758da372a. (I can provide the code I’m fuzzing, but contact me privately)

\cc @katiehockman @jayconrod

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 16 (12 by maintainers)

Commits related to this issue

Most upvoted comments

@dsnet, can you still reproduce the problem?

As of go version devel go1.18-3bbc82371e, this is still an issue. I run it with:

go.tip test -fuzz=OOM -v -parallel=1

Since each individual fuzz worker has its own namespace, running with -parallel=1 seems to manifest the problem fastest. On my Ryzen 5900x, it was allocating about ~1GiB/minute.