dask-cuda: Program crashes when running same program twice sequentially
When running the below piece of code:
import cupy as cp
import numpy as np
import dask.array as da
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import rmm
if __name__ == '__main__':
cluster = LocalCUDACluster('0', rmm_managed_memory=True)
client = Client(cluster)
client.run(cp.cuda.set_allocator, rmm.rmm_cupy_allocator)
# Here we set RMM/CuPy memory allocator on the "current" process,
# i.e., the Dask client.
rmm.reinitialize(managed_memory=True)
cp.cuda.set_allocator(rmm.rmm_cupy_allocator)
shape = (512, 512, 30000)
chunks = (100, 100, 1000)
huge_array_gpu = da.ones_like(cp.array(()), shape=shape, chunks=chunks)
array_sum = da.multiply(huge_array_gpu, 17).persist()
# `persist()` only does lazy evaluation, so we must `wait()` for the
# actual compute to occur.
wait(array_sum)
It runs perfectly the first time around. This of course creates a folder called dask-worker-space/storage. If I delete this folder, I can run the program again with no problem. If I do not, however, I get the following error:
2022-10-18 10:10:16,404 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-10-18 10:10:16,404 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-10-18 10:10:16,626 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1406, in start_unsafe
await self._register_with_scheduler()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in _register_with_scheduler
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in <dictcomp>
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/_collections_abc.py", line 850, in __iter__
for key in self._mapping:
RuntimeError: Set changed size during iteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 858, in run
await worker
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 488, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2022-10-18 10:10:16,676 - distributed.nanny - ERROR - Failed while trying to start worker process: Worker failed to start.
2022-10-18 10:10:16,677 - distributed.nanny - ERROR - Failed to connect to process
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1406, in start_unsafe
await self._register_with_scheduler()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in _register_with_scheduler
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in <dictcomp>
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/_collections_abc.py", line 850, in __iter__
for key in self._mapping:
RuntimeError: Set changed size during iteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 669, in start
msg = await self._wait_until_connected(uid)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 789, in _wait_until_connected
raise msg["exception"]
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 858, in run
await worker
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 488, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2022-10-18 10:10:16,678 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1406, in start_unsafe
await self._register_with_scheduler()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in _register_with_scheduler
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in <dictcomp>
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/_collections_abc.py", line 850, in __iter__
for key in self._mapping:
RuntimeError: Set changed size during iteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 423, in instantiate
result = await self.process.start()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 669, in start
msg = await self._wait_until_connected(uid)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 789, in _wait_until_connected
raise msg["exception"]
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 858, in run
await worker
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 488, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
Task exception was never retrieved
future: <Task finished name='Task-22' coro=<_wrap_awaitable() done, defined at /home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py:681> exception=RuntimeError('Nanny failed to start.')>
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1406, in start_unsafe
await self._register_with_scheduler()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in _register_with_scheduler
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in <dictcomp>
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/_collections_abc.py", line 850, in __iter__
for key in self._mapping:
RuntimeError: Set changed size during iteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 350, in start_unsafe
response = await self.instantiate()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 423, in instantiate
result = await self.process.start()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 669, in start
msg = await self._wait_until_connected(uid)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 789, in _wait_until_connected
raise msg["exception"]
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 858, in run
await worker
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 488, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 688, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 488, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.
2022-10-18 10:10:16,683 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fcf80946fd0>>, <Task finished name='Task-21' coro=<SpecCluster._correct_state_internal() done, defined at /home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/deploy/spec.py:319> exception=RuntimeError('Worker failed to start.')>)
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1406, in start_unsafe
await self._register_with_scheduler()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in _register_with_scheduler
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in <dictcomp>
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/_collections_abc.py", line 850, in __iter__
for key in self._mapping:
RuntimeError: Set changed size during iteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/deploy/spec.py", line 358, in _correct_state_internal
await w # for tornado gen.coroutine support
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 469, in start
raise self.__startup_exc
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 350, in start_unsafe
response = await self.instantiate()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 423, in instantiate
result = await self.process.start()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 669, in start
msg = await self._wait_until_connected(uid)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 789, in _wait_until_connected
raise msg["exception"]
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 858, in run
await worker
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 488, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2022-10-18 10:10:17,673 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-10-18 10:10:17,673 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-10-18 10:10:17,871 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1406, in start_unsafe
await self._register_with_scheduler()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in _register_with_scheduler
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in <dictcomp>
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/_collections_abc.py", line 850, in __iter__
for key in self._mapping:
RuntimeError: Set changed size during iteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 858, in run
await worker
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 488, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2022-10-18 10:10:17,915 - distributed.nanny - ERROR - Failed while trying to start worker process: Worker failed to start.
2022-10-18 10:10:17,915 - distributed.nanny - ERROR - Failed to connect to process
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1406, in start_unsafe
await self._register_with_scheduler()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in _register_with_scheduler
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in <dictcomp>
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/_collections_abc.py", line 850, in __iter__
for key in self._mapping:
RuntimeError: Set changed size during iteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 669, in start
msg = await self._wait_until_connected(uid)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 789, in _wait_until_connected
raise msg["exception"]
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 858, in run
await worker
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 488, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2022-10-18 10:10:17,916 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1406, in start_unsafe
await self._register_with_scheduler()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in _register_with_scheduler
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in <dictcomp>
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/_collections_abc.py", line 850, in __iter__
for key in self._mapping:
RuntimeError: Set changed size during iteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 423, in instantiate
result = await self.process.start()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 669, in start
msg = await self._wait_until_connected(uid)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 789, in _wait_until_connected
raise msg["exception"]
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 858, in run
await worker
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 488, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
Task exception was never retrieved
future: <Task finished name='Task-36' coro=<_wrap_awaitable() done, defined at /home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py:681> exception=RuntimeError('Nanny failed to start.')>
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1406, in start_unsafe
await self._register_with_scheduler()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in _register_with_scheduler
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in <dictcomp>
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/_collections_abc.py", line 850, in __iter__
for key in self._mapping:
RuntimeError: Set changed size during iteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 350, in start_unsafe
response = await self.instantiate()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 423, in instantiate
result = await self.process.start()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 669, in start
msg = await self._wait_until_connected(uid)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 789, in _wait_until_connected
raise msg["exception"]
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 858, in run
await worker
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 488, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 688, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 488, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.
Traceback (most recent call last):
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1406, in start_unsafe
await self._register_with_scheduler()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in _register_with_scheduler
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/worker.py", line 1093, in <dictcomp>
types={k: typename(v) for k, v in self.data.items()},
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/_collections_abc.py", line 850, in __iter__
for key in self._mapping:
RuntimeError: Set changed size during iteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/joachim/Desktop/src/pygpubatch/regex.py", line 10, in <module>
cluster = LocalCUDACluster('0', rmm_managed_memory=True)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/dask_cuda/local_cuda_cluster.py", line 366, in __init__
self.sync(self._correct_state)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/utils.py", line 338, in sync
return sync(
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/utils.py", line 405, in sync
raise exc.with_traceback(tb)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/utils.py", line 378, in f
result = yield future
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/deploy/spec.py", line 358, in _correct_state_internal
await w # for tornado gen.coroutine support
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 469, in start
raise self.__startup_exc
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 480, in start
await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 350, in start_unsafe
response = await self.instantiate()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 423, in instantiate
result = await self.process.start()
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 669, in start
msg = await self._wait_until_connected(uid)
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 789, in _wait_until_connected
raise msg["exception"]
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/nanny.py", line 858, in run
await worker
File "/home/joachim/anaconda3/envs/rps/lib/python3.9/site-packages/distributed/core.py", line 488, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 22 (10 by maintainers)
Commits related to this issue
- Make local_directory a required argument for spilling impls For automated cleanup when the cluster exits, the on-disk spilling directory needs to live inside the relevant worker's local_directory. Si... — committed to wence-/dask-cuda by wence- 2 years ago
- Make local_directory a required argument for spilling impls For automated cleanup when the cluster exits, the on-disk spilling directory needs to live inside the relevant worker's local_directory. Si... — committed to wence-/dask-cuda by wence- 2 years ago
- Make local_directory a required argument for spilling impls For automated cleanup when the cluster exits, the on-disk spilling directory needs to live inside the relevant worker's local_directory. Si... — committed to wence-/dask-cuda by wence- 2 years ago
- Make local_directory a required argument for spilling impls For automated cleanup when the cluster exits, the on-disk spilling directory needs to live inside the relevant worker's local_directory. Si... — committed to wence-/dask-cuda by wence- 2 years ago
- Make local_directory a required argument for spilling impls (#1023) For automated cleanup when the cluster exits, the on-disk spilling directory needs to live inside the relevant worker's local_direc... — committed to rapidsai/dask-cuda by wence- 2 years ago
Working on enabling this via https://github.com/dask/distributed/issues/7151