distributed: Timed out trying to connect to host after 10 s
I have a dask distributed cluster up and running on 40 workers:
dask_client = Client('localhost:8786')
dask_client.restart()
dask_client
I’ve restarted everything so no tasks are queued and the scheduler log shows:
distributed.scheduler - INFO - Clear task state
I have a large csr sparse matrix that I am scattering to the cluster:
csr_future = dask_client.scatter(csr, broadcast=True)
After a few seconds, I see:
distributed.scheduler - INFO - Remove worker tcp://10.157.169.65:38615
distributed.core - INFO - Removing comms to tcp://10.157.169.65:38615
distributed.scheduler - INFO - Remove worker tcp://10.157.169.65:33352
distributed.core - INFO - Removing comms to tcp://10.157.169.65:33352
distributed.scheduler - INFO - Register tcp://10.157.169.65:38051
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.157.169.65:38051
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://10.157.169.65:46414
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.157.169.65:46414
distributed.core - INFO - Starting established connection
So, it looks like some workers are being removed and new workers are being added back to replace those workers. Around 30 seconds after this, I see multiple tornado errors:
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 218, in connect
quiet_exceptions=EnvironmentError,
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
tornado.util.TimeoutError: Timeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
result_list.append(f.result())
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 736, in send_recv_from_rpc
comm = yield self.pool.connect(self.addr)
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 864, in connect
connection_args=self.connection_args,
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
_raise(error)
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 207, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.157.169.65:33352' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e39007c88>: ConnectionRefusedError: [Errno 111] Connection refused
distributed.core - ERROR - Timed out trying to connect to 'tcp://10.157.169.65:38615' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e3900f4a8>: ConnectionRefusedError: [Errno 111] Connection refused
Traceback (most recent call last):
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 218, in connect
quiet_exceptions=EnvironmentError,
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
tornado.util.TimeoutError: Timeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 412, in handle_comm
result = yield result
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2496, in scatter
yield self.replicate(keys=keys, workers=workers, n=n)
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2903, in replicate
for w, who_has in gathers.items()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
result_list.append(f.result())
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 736, in send_recv_from_rpc
comm = yield self.pool.connect(self.addr)
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 864, in connect
connection_args=self.connection_args,
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
_raise(error)
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 207, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.157.169.65:38615' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e3900f4a8>: ConnectionRefusedError: [Errno 111] Connection refused
It looks like the time out/connection refused are referring to the same ipaddress/ports where it was trying to Removing comms from earlier up above. I can’t seem to resolve this.
In case it matters, I am running these commands in a jupyterlab=0.35.5 that is running next to the dask-scheduler and we are running tornado=6.0.2 with dask=1.2.2.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 4
- Comments: 30 (9 by maintainers)
Commits related to this issue
- Fix hanging worker when the scheduler leaves Closes https://github.com/dask/distributed/issues/2880 — committed to TomAugspurger/distributed by TomAugspurger 5 years ago
- Fix hanging worker when the scheduler leaves (#3250) Closes https://github.com/dask/distributed/issues/2880 — committed to dask/distributed by TomAugspurger 5 years ago
Seems like this fixes it for me locally
I start a scheduler with
dask-scheduler, connect a worker, then kill the scheduler. Going to write up a test now and make a PRcc @scottyhq
I am using a local cluster and got the same error. Commenting this change out worked for me. As an alternative could you advise on how to change the “distributed.comm.timeouts.connect” setting?
Hmmm, I’m not sure what would be causing that. From the scheduler logs it looks like the workers may be dieing and restarting, and scatter may not be equipped to handle that properly. Can you look at the worker logs to see if there’s any revealing output? Workers may die if they use too much memory, or if deserialization leads to irrecoverable errors (e.g. segfaults).