dask: Memory leak with Numpy arrays and the threaded scheduler
This example leaks around 500MB of memory on my machine when using the threaded scheduler, and almost no memory when using the single-threaded scheduler:
import dask.array as da
x = da.ones((2e4, 2e4), chunks=(2e4, 100))
y = x.rechunk((100, 2e4))
z = y.rechunk((2e4, 100))
import psutil
proc = psutil.Process()
from distributed.utils import format_bytes
print(format_bytes(proc.memory_info().rss))
# 80MB
from dask.diagnostics import ProgressBar
ProgressBar().register()
# z.sum().compute(scheduler='single-threaded') # This doesn't cause problems
z.sum().compute(scheduler='threads') # This leaks around 500MB of memory
print(format_bytes(proc.memory_info().rss))
# 500-600MB
This doesn’t happen when I run it with the single-threaded scheduler.
Calling gc.collect() doesn’t help. Allocating a new large numpy array afterwards also doesn’t take up the leaked memory, the number just climbs. Looking at the objects that Python knows about shows that there isn’t much around:
from pympler import muppy
all_objects = muppy.get_objects()
from pympler import summary
sum1 = summary.summarize(all_objects)
summary.print_(sum1)
types | # objects | total size
=========================================== | =========== | ============
<class 'str | 60517 | 8.75 MB
<class 'dict | 11991 | 5.30 MB
<class 'code | 19697 | 2.72 MB
<class 'type | 2228 | 2.24 MB
<class 'tuple | 16142 | 1.04 MB
<class 'set | 2285 | 858.84 KB
<class 'list | 7284 | 738.09 KB
<class 'weakref | 4412 | 344.69 KB
<class 'abc.ABCMeta | 261 | 263.54 KB
function (__init__) | 1378 | 183.02 KB
<class 'traitlets.traitlets.MetaHasTraits | 180 | 175.45 KB
<class 'wrapper_descriptor | 2240 | 175.00 KB
<class 'int | 5584 | 168.92 KB
<class 'getset_descriptor | 2389 | 167.98 KB
<class 'collections.OrderedDict | 292 | 141.00 KB
The local schedulers don’t have any persistent state. My next step is to reproduce with the standard concurrent.futures module, but I thought I’d put this up early in case people have suggestions.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 56 (37 by maintainers)
@chakpak I believe the tl;dr on this is that it’s neither Dask’s nor NumPy’s problem, and not something we can safely resolve here, but instead an issue with the underlying C-level memory allocator and its heuristics for when to release memory back to the OS.
Therefore, I’d suggest either adjusting the settings for the allocator if you’re on linux:
Or trying a different allocator if you’re on macOS:
Read this for more information: https://distributed.dask.org/en/latest/worker.html#memory-not-released-back-to-the-os
But a critical thing to note with this issue is that NumPy/Python/Dask isn’t actually using too much memory. There’s no leak.
It’s just a bookkeeping issue. The Python process is hanging onto a bunch of memory it’s not actually using at the moment. But if it needs more memory again, it should re-use that memory without asking the OS for more. The alternative (when you set
MALLOC_TRIM_THRESHOLD_=0) is that unused memory gets returned to the OS immediately, but then the process has to ask the OS for memory again if it needs more in the future:If multiple important processes on the machine are all sharing memory, then having dask hog it is not ideal. However, if dask is the only major thing running (say, on a VM in AWS), then it may not actually matter that this unused memory doesn’t get released back to the OS, because nothing else needs it.
Just thought I’d mention that I’ve been observing apparent memory leaks when using dask array on a kubernetes cluster, several GB was accumulating in the main process where the client and scheduler are running despite no large results being computed into memory. Memory usage remained after closing client and cluster. Clearing scheduler logs freed a little memory but did not resolve the main leak.
I found that setting MALLOC_MMAP_THRESHOLD_=16384 substantially improves this, i.e., now I get ~200MB where I was getting ~2GB leaking.
@njsmith
Replacing the allocator at runtime is indeed non-trivial, but this (now outdated) PR did just that: https://github.com/numpy/numpy/pull/5470
@tzickel
Can you try
MALLOC_MMAP_THRESHOLD_=16384and report the results? (note: the trailing underscore isn’t a typo!)@gjoseph92 this was a very good explanation. Thanks for following up. This makes sense and we will experiment further.
This is long, apologies! Even single threaded a similar problem can be forced (Centos 7, glibc):
this gives me:
LD_PRELOADingjemallocand rerunning made no difference to the outcome (obviouslymallinfodoesn’t work as allocations are redirected).Switching back to GLIBC memory allocation. Adding in a shim library to trace the
malloc/frees withmtrace(3)and then scraping the log shows:which are the first 3 large chunks from the assignment to
x. Later similar appears for the smaller chunks when materialisation occurs, these are size0xf10, a sample is here:the allocation/frees of size
0xf10are typically interspersed with other 32 and 8 byte allocations. It should be noted that running themtracePerl script on the log shows no leak from NumPy (but a bit from Python).I’m pondering on the following things:
mmap/brkthreshold split for now, to achieve an artificially high RSS one could simply interleave large allocations with one byte allocations, then free the large ones. RSS will be large, but the actual amount in use by current allocations will be small.stdlib.hfamily of allocation functions will need to be compiled against them, else there may be duplicate symbol problems? Obviously the allocation library could be statically linked, but this would potentially hit the noted problems over malloc/free pairs.Memory allocation is not an exact science. The heuristics used by each allocator may backfire in some use cases. When I mentioned
jemalloc, I did not want to imply that it would be a silver bullet 😃 Just that it could be interesting to run experiments with it.Yes, this seems nicer.
It does reliably seem to increase runtimes, though that would be acceptable.