numba: Cache causes Segmentation Faults when generated in parallel
MacBook Pro 2018 MacOS Catalina 10.15 anaconda (pkgs-main repository) Python 3.7.4 numba=0.46.0 llvmlite=0.30.0
I’m running 6 dask distributed workers on localhost, 1 thread per worker.
I push to dask 9000 tasks which invoke pure-python functions , which in turn internally import and run 1600 functions, 1 function per module, decorated with @guvectorize(cache=True).
Before starting, I clean all of my __pycache__ directories. The decorated functions are not imported on the client, which means that the on-disk cache is generated for the first time on the dask workers. As multiple tasks run at the same time using the same decorated function, it is very likely that two python interpreters will import the same python module containing the decorated functions at the same time and, simultaneously, build and save the cache.
To clarify:
client.py:
def f1(x):
import worker1
worker1.g(x)
[...]
def f1600(x):
import worker1600
worker1600.g(x)
def main():
with distributed.LocalCluster(n_workers=6, threads_per_worker=1) as cluster:
with distributed.Client(cluster) as client:
tasks = [
submit(f, x)
for f in (f1, f2, f3, ... f1600)
for x in range(6)
]
client.gather(tasks)
if __name__ == '__main__':
main()
worker1.py ~ worker1600.py:
import numba
@numba.guvectorize(..., cache=True)
def g(x):
...
The dask workers randomly fail with Segmentation Fault; the same is reproduced when I later try running the same functions one by one by hand. Only deleting the cache files solves the issue.
If I instead build the cache serially first, and then start sending tasks to dask afterwards so that the workers don’t rebuild the cache but just read it from disk, the crashes disappear.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 2
- Comments: 28 (12 by maintainers)
Commits related to this issue
- Fix race condition in cache read/write. As title. Fixes #4807 — committed to stuartarchibald/numba by stuartarchibald 4 years ago
@crusaderky https://github.com/stuartarchibald/numba/commit/af64272017f1aeaba36c0c177b736c609ea2908b fixes, need to work out how to test it.
Indeed.
WHAM! Self-contained POC that falls over after less than 30 seconds 🥇
Output:
And after that, by hand: