numba: Cache causes Segmentation Faults when generated in parallel

MacBook Pro 2018 MacOS Catalina 10.15 anaconda (pkgs-main repository) Python 3.7.4 numba=0.46.0 llvmlite=0.30.0

I’m running 6 dask distributed workers on localhost, 1 thread per worker. I push to dask 9000 tasks which invoke pure-python functions , which in turn internally import and run 1600 functions, 1 function per module, decorated with @guvectorize(cache=True).

Before starting, I clean all of my __pycache__ directories. The decorated functions are not imported on the client, which means that the on-disk cache is generated for the first time on the dask workers. As multiple tasks run at the same time using the same decorated function, it is very likely that two python interpreters will import the same python module containing the decorated functions at the same time and, simultaneously, build and save the cache.

To clarify:

client.py:

def f1(x):
    import worker1
    worker1.g(x)

[...]

def f1600(x):
    import worker1600
    worker1600.g(x)

def main():
    with distributed.LocalCluster(n_workers=6, threads_per_worker=1) as cluster:
        with distributed.Client(cluster) as client:
            tasks = [
                submit(f, x)
                for f in (f1, f2, f3, ... f1600)
                for x in range(6)
            ]
            client.gather(tasks)

if __name__ == '__main__':
    main()

worker1.py ~ worker1600.py:

import numba

@numba.guvectorize(..., cache=True)
def g(x):
    ...

The dask workers randomly fail with Segmentation Fault; the same is reproduced when I later try running the same functions one by one by hand. Only deleting the cache files solves the issue.

If I instead build the cache serially first, and then start sending tasks to dask afterwards so that the workers don’t rebuild the cache but just read it from disk, the crashes disappear.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 2
  • Comments: 28 (12 by maintainers)

Commits related to this issue

Most upvoted comments

💥 zsh» ipython
Python 3.7.4 (default, Aug 13 2019, 15:17:50)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from numba_segfault._0011 import f

In [2]: f(1.0)
[7]    46916 segmentation fault (core dumped)  ipython
ipython  0.98s user 6.34s system 45% cpu 16.195 total

Indeed.

WHAM! Self-contained POC that falls over after less than 30 seconds 🥇

import importlib
import os
import shutil
from concurrent.futures import ProcessPoolExecutor
from textwrap import dedent


os.chdir(os.path.abspath(os.path.dirname(__file__)))


def f(i, x):
    mod = importlib.import_module(f"numba_segfault._{i:04d}")
    mod.f(x)


def main():
    try:
        shutil.rmtree("numba_segfault")
    except FileNotFoundError:
        pass

    os.mkdir("numba_segfault")
    with open("numba_segfault/__init__.py", "w"):
        pass
    for i in range(1600):
        with open(f"numba_segfault/_{i:04d}.py", "w") as fh:
            fh.write(
                dedent(
                    """
                    from numba import guvectorize, f8

                    @guvectorize([(f8, f8[:])], "()->()", nopython=True, cache=True)
                    def f(x, out):
                        out[0] = x * 2
                    """
                ).lstrip()
            )

    with ProcessPoolExecutor(8) as ex:
        for i in range(1600):
            futures = [ex.submit(f, i, x) for x in range(8)]
            for future in futures:
                future.result()
            print(i)

if __name__ == "__main__":
    main()

Output:

[...]
75
76
77
Traceback (most recent call last):
  File "numba_segfault_runner.py", line 47, in <module>
    main()
  File "numba_segfault_runner.py", line 43, in main
    future.result()
  File "lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

And after that, by hand:

>>> from numba_segfault._0077 import f                                                                                                                                                                                      
>>> f(2)                                                                                                                                                                                                                    
4.0
>>> from numba_segfault._0078 import f                                                                                                                                                                                      
>>> f(2)                                                                                                                                                                                                                    
Segmentation fault: 11