distributed: Nanny error: Worker process was killed by unknown signal

distributed.nanny - WARNING - Worker process 13375 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 13377 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 13372 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 13383 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 13373 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 13384 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 13380 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker

Happens without fail when using read_parquet with fastparquet can be avoided with pyarrow but still happens x% of the time. (x depends on how you setup n_workers, n_clients, memory_limit in client but would say is always greater than 25%).

My machine runs Fedora 27 and I was able to work around the problem by setting multiprocessing-method to spawn thanks to help from @mrocklin.

(In debugging this with @mrocklin we were never able to get more information out about what the root cause was).

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 15 (8 by maintainers)

Most upvoted comments

Another (self contained, though less minimal) example:

import pyarrow
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(3_000_000,20))
df.columns=['a_{}'.format(i) for i in range(20)]
df['a_1']=(df['a_1']*10000).astype(int)
df.to_parquet('./test.p', compression='gzip')
from dask.distributed import Client
import dask.dataframe as dd

client = Client()
df2 = dd.read_parquet('./test.p')
df2 = client.persist(df2)

So far so good. Then df2.mean().compute() results in traceback:

distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:53696, threads: 1>>
Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 355, in catch_zombie
    yield
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 1229, in _run
    return self.callback()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/nanny.py", line 264, in memory_monitor
    memory = proc.memory_info().rss
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/__init__.py", line 1047, in memory_info
    return self._proc.memory_info()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 335, in wrapper
    return fun(self, *args, **kwargs)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 463, in memory_info
    rawtuple = self._get_pidtaskinfo()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 368, in catch_zombie
    raise AccessDenied(proc.pid, proc._name)
psutil._exceptions.AccessDenied: psutil.AccessDenied (pid=40091)
distributed.nanny - WARNING - Worker process 40091 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:53698, threads: 1>>
Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 355, in catch_zombie
    yield
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 1229, in _run
    return self.callback()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/nanny.py", line 264, in memory_monitor
    memory = proc.memory_info().rss
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/__init__.py", line 1047, in memory_info
    return self._proc.memory_info()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 335, in wrapper
    return fun(self, *args, **kwargs)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 463, in memory_info
    rawtuple = self._get_pidtaskinfo()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 368, in catch_zombie
    raise AccessDenied(proc.pid, proc._name)
psutil._exceptions.AccessDenied: psutil.AccessDenied (pid=40090)
distributed.nanny - WARNING - Worker process 40090 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:53701, threads: 1>>
Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 355, in catch_zombie
    yield
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 1229, in _run
    return self.callback()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/nanny.py", line 264, in memory_monitor
    memory = proc.memory_info().rss
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/__init__.py", line 1047, in memory_info
    return self._proc.memory_info()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 335, in wrapper
    return fun(self, *args, **kwargs)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 463, in memory_info
    rawtuple = self._get_pidtaskinfo()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 368, in catch_zombie
    raise AccessDenied(proc.pid, proc._name)
psutil._exceptions.AccessDenied: psutil.AccessDenied (pid=40095)
distributed.nanny - WARNING - Worker process 40095 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:53702, threads: 1>>
Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 355, in catch_zombie
    yield
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 1229, in _run
    return self.callback()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/nanny.py", line 264, in memory_monitor
    memory = proc.memory_info().rss
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/__init__.py", line 1047, in memory_info
    return self._proc.memory_info()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 335, in wrapper
    return fun(self, *args, **kwargs)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 463, in memory_info
    rawtuple = self._get_pidtaskinfo()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 368, in catch_zombie
    raise AccessDenied(proc.pid, proc._name)
psutil._exceptions.AccessDenied: psutil.AccessDenied (pid=40094)
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:53702, threads: 1>>
Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 355, in catch_zombie
    yield
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py", line 1229, in _run
    return self.callback()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/nanny.py", line 264, in memory_monitor
    memory = proc.memory_info().rss
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/__init__.py", line 1047, in memory_info
    return self._proc.memory_info()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 335, in wrapper
    return fun(self, *args, **kwargs)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 463, in memory_info
    rawtuple = self._get_pidtaskinfo()
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_common.py", line 340, in wrapper
    return fun(self)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 394, in _get_pidtaskinfo
    ret = cext.proc_pidtaskinfo_oneshot(self.pid)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/jamesmckeown/anaconda2/envs/py36/lib/python3.6/site-packages/psutil/_psosx.py", line 368, in catch_zombie
    raise AccessDenied(proc.pid, proc._name)
psutil._exceptions.AccessDenied: psutil.AccessDenied (pid=40094)
distributed.nanny - WARNING - Worker process 40094 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
----------------------------------------
KilledWorkerTraceback (most recent call last)
<ipython-input-8-3f4e05f049ae> in <module>
----> 1 df2.mean().compute()

~/anaconda2/envs/py36/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    154         dask.base.compute
    155         """
--> 156         (result,) = compute(self, traverse=False, **kwargs)
    157         return result
    158 

~/anaconda2/envs/py36/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    393     keys = [x.__dask_keys__() for x in collections]
    394     postcomputes = [x.__dask_postcompute__() for x in collections]
--> 395     results = schedule(dsk, keys, **kwargs)
    396     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    397 

~/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, **kwargs)
   2228             try:
   2229                 results = self.gather(packed, asynchronous=asynchronous,
-> 2230                                       direct=direct)
   2231             finally:
   2232                 for f in futures.values():

~/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
   1591             return self.sync(self._gather, futures, errors=errors,
   1592                              direct=direct, local_worker=local_worker,
-> 1593                              asynchronous=asynchronous)
   1594 
   1595     @gen.coroutine

~/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
    645             return future
    646         else:
--> 647             return sync(self.loop, func, *args, **kwargs)
    648 
    649     def __repr__(self):

~/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    275             e.wait(10)
    276     if error[0]:
--> 277         six.reraise(*error[0])
    278     else:
    279         return result[0]

~/anaconda2/envs/py36/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

~/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/utils.py in f()
    260             if timeout is not None:
    261                 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262             result[0] = yield future
    263         except Exception as exc:
    264             error[0] = sys.exc_info()

~/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1131 
   1132                     try:
-> 1133                         value = future.result()
   1134                     except Exception:
   1135                         self.had_exception = True

~/anaconda2/envs/py36/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1139                     if exc_info is not None:
   1140                         try:
-> 1141                             yielded = self.gen.throw(*exc_info)
   1142                         finally:
   1143                             # Break up a reference to itself

~/anaconda2/envs/py36/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1467                             six.reraise(type(exception),
   1468                                         exception,
-> 1469                                         traceback)
   1470                     if errors == 'skip':
   1471                         bad_keys.add(key)

~/anaconda2/envs/py36/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

KilledWorker: ("('dataframe-sum-chunk-dataframe-sum-agg-c3bb33a21a4ded5f08dea3ba88b780b0', 0)", 'tcp://127.0.0.1:53702')

Similar code snippets which execute as expected:

  1. Remove the line df['a_1']=(df['a_1']*10000).astype(int).
  2. Reduce np.random.rand(3_000_000,20) to np.random.rand(2_000_000,20)
pdf = pd.DataFrame(np.random.rand(10_000_000,20))
df = dd.from_pandas(pdf,chunksize=10000)
df2 = client.persist(df)
df2.mean().compute()

This issue would benefit from a minimum reproducible example.