dask: Memory leak with Numpy arrays and the threaded scheduler

This example leaks around 500MB of memory on my machine when using the threaded scheduler, and almost no memory when using the single-threaded scheduler:

import dask.array as da
x = da.ones((2e4, 2e4), chunks=(2e4, 100))
y = x.rechunk((100, 2e4))
z = y.rechunk((2e4, 100))

import psutil
proc = psutil.Process()

from distributed.utils import format_bytes
print(format_bytes(proc.memory_info().rss))
# 80MB

from dask.diagnostics import ProgressBar
ProgressBar().register()

# z.sum().compute(scheduler='single-threaded')  # This doesn't cause problems
z.sum().compute(scheduler='threads')  # This leaks around 500MB of memory

print(format_bytes(proc.memory_info().rss))
# 500-600MB

Notebook

This doesn’t happen when I run it with the single-threaded scheduler.

Calling gc.collect() doesn’t help. Allocating a new large numpy array afterwards also doesn’t take up the leaked memory, the number just climbs. Looking at the objects that Python knows about shows that there isn’t much around:

from pympler import muppy
all_objects = muppy.get_objects()

from pympler import summary
sum1 = summary.summarize(all_objects)
summary.print_(sum1)                          
                                      types |   # objects |   total size
=========================================== | =========== | ============
                                <class 'str |       60517 |      8.75 MB
                               <class 'dict |       11991 |      5.30 MB
                               <class 'code |       19697 |      2.72 MB
                               <class 'type |        2228 |      2.24 MB
                              <class 'tuple |       16142 |      1.04 MB
                                <class 'set |        2285 |    858.84 KB
                               <class 'list |        7284 |    738.09 KB
                            <class 'weakref |        4412 |    344.69 KB
                        <class 'abc.ABCMeta |         261 |    263.54 KB
                        function (__init__) |        1378 |    183.02 KB
  <class 'traitlets.traitlets.MetaHasTraits |         180 |    175.45 KB
                 <class 'wrapper_descriptor |        2240 |    175.00 KB
                                <class 'int |        5584 |    168.92 KB
                  <class 'getset_descriptor |        2389 |    167.98 KB
            <class 'collections.OrderedDict |         292 |    141.00 KB

The local schedulers don’t have any persistent state. My next step is to reproduce with the standard concurrent.futures module, but I thought I’d put this up early in case people have suggestions.

cc @shoyer @pitrou @njsmith

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 56 (37 by maintainers)

Most upvoted comments

@chakpak I believe the tl;dr on this is that it’s neither Dask’s nor NumPy’s problem, and not something we can safely resolve here, but instead an issue with the underlying C-level memory allocator and its heuristics for when to release memory back to the OS.

Therefore, I’d suggest either adjusting the settings for the allocator if you’re on linux:

$ MALLOC_TRIM_THRESHOLD_=0 python my_script.py ...

Or trying a different allocator if you’re on macOS:

$ brew install jemalloc
$ DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python my_script.py ...

Read this for more information: https://distributed.dask.org/en/latest/worker.html#memory-not-released-back-to-the-os

But a critical thing to note with this issue is that NumPy/Python/Dask isn’t actually using too much memory. There’s no leak.

It’s just a bookkeeping issue. The Python process is hanging onto a bunch of memory it’s not actually using at the moment. But if it needs more memory again, it should re-use that memory without asking the OS for more. The alternative (when you set MALLOC_TRIM_THRESHOLD_=0) is that unused memory gets returned to the OS immediately, but then the process has to ask the OS for memory again if it needs more in the future: memory

If multiple important processes on the machine are all sharing memory, then having dask hog it is not ideal. However, if dask is the only major thing running (say, on a VM in AWS), then it may not actually matter that this unused memory doesn’t get released back to the OS, because nothing else needs it.

Just thought I’d mention that I’ve been observing apparent memory leaks when using dask array on a kubernetes cluster, several GB was accumulating in the main process where the client and scheduler are running despite no large results being computed into memory. Memory usage remained after closing client and cluster. Clearing scheduler logs freed a little memory but did not resolve the main leak.

I found that setting MALLOC_MMAP_THRESHOLD_=16384 substantially improves this, i.e., now I get ~200MB where I was getting ~2GB leaking.

@njsmith

Replacing the allocator at runtime is indeed non-trivial, but this (now outdated) PR did just that: https://github.com/numpy/numpy/pull/5470

@tzickel

Can you try MALLOC_MMAP_THRESHOLD_=16384 and report the results? (note: the trailing underscore isn’t a typo!)

@gjoseph92 this was a very good explanation. Thanks for following up. This makes sense and we will experiment further.

This is long, apologies! Even single threaded a similar problem can be forced (Centos 7, glibc):

import dask.array as da
import psutil
import ctypes
from distributed.utils import format_bytes
from dask.diagnostics import ProgressBar


class MallInfo(ctypes.Structure):
    _fields_ = [(name, ctypes.c_int)
                for name in ('arena', 'ordblks', 'smblks', 'hblks', 'hblkhd',
                             'usmblks', 'fsmblks', 'uordblks', 'fordblks',
                             'keepcost')]


libc = ctypes.CDLL("libc.so.6")
mallinfo = libc.mallinfo
mallinfo.argtypes = []
mallinfo.restype = MallInfo

# From the example on the mallinfo man page. e.g.
# https://linux.die.net/man/3/mallinfo
def pprint(mi):
    print("Total non-mmapped bytes (arena):       %s" % format_bytes(mi.arena));
    print("# of free chunks (ordblks):            %s" % format_bytes(mi.ordblks));
    print("# of free fastbin blocks (smblks):     %s" % format_bytes(mi.smblks));
    print("# of mapped regions (hblks):           %s" % format_bytes(mi.hblks));
    print("Bytes in mapped regions (hblkhd):      %s" % format_bytes(mi.hblkhd));
    print("Max. total allocated space (usmblks):  %s" % format_bytes(mi.usmblks));
    print("Free bytes held in fastbins (fsmblks): %s" % format_bytes(mi.fsmblks));
    print("Total allocated space (uordblks):      %s" % format_bytes(mi.uordblks));
    print("Total free space (fordblks):           %s" % format_bytes(mi.fordblks));
    print("Topmost releasable block (keepcost):   %s" % format_bytes(mi.keepcost));

def show_mallinfo():
    print("Malloc info:")
    pprint(mallinfo())
    print("")


proc = psutil.Process()

print(format_bytes(proc.memory_info().rss))
print(proc.memory_full_info())
show_mallinfo()

# initial alloc should be for chunks 3 chunks of size 0x1c71c70
x = da.ones((0x111111, 0xA), chunks=(0x5B05B, 0xA))
z = x.rechunk((0xf1, 0x2)) # chunk size 0xf10 for materialize

print(format_bytes(proc.memory_info().rss))
print(proc.memory_full_info())
show_mallinfo()

ProgressBar().register()

z.sum().compute(scheduler='single-threaded')

print(format_bytes(proc.memory_info().rss))
print(proc.memory_full_info())
show_mallinfo()

this gives me:

50.23 MB
pfullmem(rss=50229248, vms=434524160, shared=14315520, text=2879488, lib=0, data=32526336, dirty=0, uss=36958208, pss=43038720, swap=0)
Malloc info:
Total non-mmapped bytes (arena):       9.92 MB
# of free chunks (ordblks):            269 B
# of free fastbin blocks (smblks):     0 B
# of mapped regions (hblks):           1 B
Bytes in mapped regions (hblkhd):      1.31 MB
Max. total allocated space (usmblks):  0 B
Free bytes held in fastbins (fsmblks): 0 B
Total allocated space (uordblks):      9.34 MB
Total free space (fordblks):           585.50 kB
Topmost releasable block (keepcost):   546.45 kB

63.62 MB
pfullmem(rss=63619072, vms=447827968, shared=14376960, text=2879488, lib=0, data=45830144, dirty=0, uss=50257920, pss=56367104, swap=0)
Malloc info:
Total non-mmapped bytes (arena):       11.61 MB
# of free chunks (ordblks):            276 B
# of free fastbin blocks (smblks):     0 B
# of mapped regions (hblks):           2 B
Bytes in mapped regions (hblkhd):      3.94 MB
Max. total allocated space (usmblks):  0 B
Free bytes held in fastbins (fsmblks): 0 B
Total allocated space (uordblks):      9.38 MB
Total free space (fordblks):           2.23 MB
Topmost releasable block (keepcost):   2.09 MB

[########################################] | 100% Completed | 26.1s
156.76 MB
pfullmem(rss=156762112, vms=1145044992, shared=14606336, text=2879488, lib=0, data=743047168, dirty=0, uss=143835136, pss=149986304, swap=0)
Malloc info:
Total non-mmapped bytes (arena):       45.87 MB
# of free chunks (ordblks):            366 B
# of free fastbin blocks (smblks):     112 B
# of mapped regions (hblks):           2 B
Bytes in mapped regions (hblkhd):      3.94 MB
Max. total allocated space (usmblks):  0 B
Free bytes held in fastbins (fsmblks): 6.37 kB
Total allocated space (uordblks):      28.54 MB
Total free space (fordblks):           17.33 MB
Topmost releasable block (keepcost):   3.54 MB

LD_PRELOADing jemalloc and rerunning made no difference to the outcome (obviously mallinfo doesn’t work as allocations are redirected).

56.31 MB
pfullmem(rss=56311808, vms=444801024, shared=14376960, text=2879488, lib=0, data=42602496, dirty=0, uss=43114496, pss=49190912, swap=0)
Malloc info:
Total non-mmapped bytes (arena):       0 B
# of free chunks (ordblks):            1 B
# of free fastbin blocks (smblks):     0 B
# of mapped regions (hblks):           0 B
Bytes in mapped regions (hblkhd):      0 B
Max. total allocated space (usmblks):  0 B
Free bytes held in fastbins (fsmblks): 0 B
Total allocated space (uordblks):      0 B
Total free space (fordblks):           0 B
Topmost releasable block (keepcost):   0 B

67.61 MB
pfullmem(rss=67612672, vms=457646080, shared=14450688, text=2879488, lib=0, data=55447552, dirty=0, uss=54280192, pss=60385280, swap=0)
Malloc info:
Total non-mmapped bytes (arena):       0 B
# of free chunks (ordblks):            1 B
# of free fastbin blocks (smblks):     0 B
# of mapped regions (hblks):           0 B
Bytes in mapped regions (hblkhd):      0 B
Max. total allocated space (usmblks):  0 B
Free bytes held in fastbins (fsmblks): 0 B
Total allocated space (uordblks):      0 B
Total free space (fordblks):           0 B
Topmost releasable block (keepcost):   0 B

[########################################] | 100% Completed | 17.7s
157.43 MB
pfullmem(rss=157429760, vms=760979456, shared=14684160, text=2879488, lib=0, data=358780928, dirty=0, uss=144236544, pss=150386688, swap=0)
Malloc info:
Total non-mmapped bytes (arena):       0 B
# of free chunks (ordblks):            1 B
# of free fastbin blocks (smblks):     0 B
# of mapped regions (hblks):           0 B
Bytes in mapped regions (hblkhd):      0 B
Max. total allocated space (usmblks):  0 B
Free bytes held in fastbins (fsmblks): 0 B
Total allocated space (uordblks):      0 B
Total free space (fordblks):           0 B
Topmost releasable block (keepcost):   0 B

Switching back to GLIBC memory allocation. Adding in a shim library to trace the malloc/frees with mtrace(3) and then scraping the log shows:

@ <path>/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c8284e35] + 0x7fa28e38e040 0x1c71c70
@ python:[0x55932d794e50] - 0x7fa2a8001660
@ python:(_PyObject_GC_Resize+0x3d)[0x55932d73d73d] < 0x7fa2a80011d0
@ python:(_PyObject_GC_Resize+0x3d)[0x55932d73d73d] > 0x7fa2a8001690 0x280
@ <path>/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c8284e35] + 0x7fa28c71c040 0x1c71c70
@ python:[0x55932d794e50] - 0x7fa2ac001660
@ python:(_PyObject_GC_Resize+0x3d)[0x55932d73d73d] < 0x7fa2ac0011d0
@ python:(_PyObject_GC_Resize+0x3d)[0x55932d73d73d] > 0x7fa2ac001690 0x280
@ <path>/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c8284e35] + 0x7fa28aaaa040 0x1c71c70

which are the first 3 large chunks from the assignment to x. Later similar appears for the smaller chunks when materialisation occurs, these are size 0xf10, a sample is here:

@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c8284cbf] + 0x7fa2b0001bc0 0x20
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c8284e35] + 0x7fa2b0001c40 0x8
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c837a53f] + 0x7fa2b0001c80 0xf10
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c837e8b6] - 0x7fa2b0001c80
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x7fa2b0001c10 0x20
@ python:[0x55932d794e50] - 0x7fa2980015f0
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x7fa2980015f0 0x20
@ python:[0x55932d794e50] - 0x7fa29c001570
@ python:[0x55932d794e50] - 0x55932eb7fc90
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x7fa29c001570 0x20
@ python:[0x55932d794e50] - 0x7fa2a80011d0
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c8284cbf] + 0x7fa2a8001c00 0x20
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c8284e35] + 0x7fa2a8001cc0 0x8
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c837a53f] + 0x7fa2a8001d00 0xf10
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c837e8b6] - 0x7fa2a8001d00
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x7fa2a8001660 0x20
@ python:[0x55932d794e50] - 0x7fa2a00011d0
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x7fa2a00011d0 0x20
@ python:[0x55932d794e50] - 0x7fa2ac001660
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x7fa2ac001660 0x20
@ python:[0x55932d794e50] - 0x55932e6dfd10
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x55932e6dfd10 0x20
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x55932eb7fc90 0x20
@ python:[0x55932d794e50] - 0x7fa29c001570
@ python:[0x55932d794e50] - 0x55932e6825e0
@ python:[0x55932d794e50] - 0x55932eb71b50
@ python:[0x55932d794e50] - 0x55932ecbb4b0
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x7fa29c001570 0x20
@ python:[0x55932d794e50] - 0x7fa2980015f0
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x7fa2980015f0 0x20
@ python:[0x55932d794e50] - 0x55932eb7fc90
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x55932eb7fc90 0x20
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x55932ecbb4b0 0x20
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x55932eb71b50 0x20
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x55932e6825e0 0x20
@ python:[0x55932d794e50] - 0x7fa2b0001c10
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x7fa2b0001c10 0x20
@ python:[0x55932d794e50] - 0x7fa2980015f0
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x7fa2980015f0 0x20
@ python:[0x55932d794e50] - 0x7fa29c001570
@ python:[0x55932d794e50] - 0x55932e6dfd10
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x7fa29c001570 0x20
@ python:[0x55932d794e50] - 0x7fa2a8001660
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c8284cbf] + 0x7fa2a8001d00 0x20
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c8284e35] + 0x7fa2a8001c80 0x8
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c837a53f] + 0x7fa2a8001d80 0xf10
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c837e8b6] - 0x7fa2a8001d80
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c8284cbf] + 0x7fa2a8001d80 0x20
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c8284e35] + 0x7fa2a8001e00 0x8
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c837a53f] + 0x7fa2a8001e40 0xf10
@ <path>site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so:[0x7fa2c837e8b6] - 0x7fa2a8001e40
@ python:(PyThread_allocate_lock+0x16)[0x55932d713456] + 0x7fa2a80012d0 0x20

the allocation/frees of size 0xf10 are typically interspersed with other 32 and 8 byte allocations. It should be noted that running the mtrace Perl script on the log shows no leak from NumPy (but a bit from Python).

I’m pondering on the following things:

  • It is important to differentiate between a “leak” as in, memory that has been allocated but has not been freed, and a large RSS, which is essentially the total of pages in RAM.
  • Ignoring overcoming the mmap/brk threshold split for now, to achieve an artificially high RSS one could simply interleave large allocations with one byte allocations, then free the large ones. RSS will be large, but the actual amount in use by current allocations will be small.
  • Some low level allocation libraries claim to be and are perhaps provably better than others at handling fragmentation. To make use of these, I think everything which uses the stdlib.h family of allocation functions will need to be compiled against them, else there may be duplicate symbol problems? Obviously the allocation library could be statically linked, but this would potentially hit the noted problems over malloc/free pairs.
  • The above traces suggest fragmentation, this is probably somewhat expected given the interpreter probably has a lot of heap allocation churn, NumPy itself has some small heap allocation churn and finally there are large data blocks being heap allocated and then quickly free’d, all mixed in!

Memory allocation is not an exact science. The heuristics used by each allocator may backfire in some use cases. When I mentioned jemalloc, I did not want to imply that it would be a silver bullet 😃 Just that it could be interesting to run experiments with it.

Yes, this seems nicer.

mrocklin@carbon:~/workspace/play$ python memory-sharding.py 
52.22 MB
[########################################] | 100% Completed |  6.5s
[########################################] | 100% Completed |  0.1s
697.00 MB
mrocklin@carbon:~/workspace/play$ export MALLOC_MMAP_THRESHOLD_=16384
mrocklin@carbon:~/workspace/play$ python memory-sharding.py 
52.42 MB
[########################################] | 100% Completed |  7.3s
[########################################] | 100% Completed |  0.1s
56.50 MB

It does reliably seem to increase runtimes, though that would be acceptable.