distributed: Timed out trying to connect to host after 10 s

I have a dask distributed cluster up and running on 40 workers:

dask_client = Client('localhost:8786')
dask_client.restart()
dask_client

I’ve restarted everything so no tasks are queued and the scheduler log shows:

distributed.scheduler - INFO - Clear task state

I have a large csr sparse matrix that I am scattering to the cluster:

csr_future = dask_client.scatter(csr, broadcast=True)

After a few seconds, I see:

distributed.scheduler - INFO - Remove worker tcp://10.157.169.65:38615
distributed.core - INFO - Removing comms to tcp://10.157.169.65:38615
distributed.scheduler - INFO - Remove worker tcp://10.157.169.65:33352
distributed.core - INFO - Removing comms to tcp://10.157.169.65:33352
distributed.scheduler - INFO - Register tcp://10.157.169.65:38051
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.157.169.65:38051
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://10.157.169.65:46414
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.157.169.65:46414
distributed.core - INFO - Starting established connection

So, it looks like some workers are being removed and new workers are being added back to replace those workers. Around 30 seconds after this, I see multiple tornado errors:

tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 218, in connect
    quiet_exceptions=EnvironmentError,
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
tornado.util.TimeoutError: Timeout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 736, in send_recv_from_rpc
    comm = yield self.pool.connect(self.addr)
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 864, in connect
    connection_args=self.connection_args,
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
    _raise(error)
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 207, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.157.169.65:33352' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e39007c88>: ConnectionRefusedError: [Errno 111] Connection refused
distributed.core - ERROR - Timed out trying to connect to 'tcp://10.157.169.65:38615' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e3900f4a8>: ConnectionRefusedError: [Errno 111] Connection refused
Traceback (most recent call last):
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 218, in connect
    quiet_exceptions=EnvironmentError,
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
tornado.util.TimeoutError: Timeout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 412, in handle_comm
    result = yield result
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2496, in scatter
    yield self.replicate(keys=keys, workers=workers, n=n)
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2903, in replicate
    for w, who_has in gathers.items()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 736, in send_recv_from_rpc
    comm = yield self.pool.connect(self.addr)
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 864, in connect
    connection_args=self.connection_args,
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
    _raise(error)
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 207, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.157.169.65:38615' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e3900f4a8>: ConnectionRefusedError: [Errno 111] Connection refused

It looks like the time out/connection refused are referring to the same ipaddress/ports where it was trying to Removing comms from earlier up above. I can’t seem to resolve this.

In case it matters, I am running these commands in a jupyterlab=0.35.5 that is running next to the dask-scheduler and we are running tornado=6.0.2 with dask=1.2.2.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 4
  • Comments: 30 (9 by maintainers)

Commits related to this issue

Most upvoted comments

Seems like this fixes it for me locally

diff --git a/distributed/worker.py b/distributed/worker.py
index e3ef6b26..e5666aca 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -882,6 +882,12 @@ class Worker(ServerNode):
                 )
                 self.bandwidth_workers.clear()
                 self.bandwidth_types.clear()
+            except IOError as e:
+                # Scheduler is gone. Respect distributed.comm.timeouts.connect
+                if "Timed out trying to connect" in str(e):
+                    await self.close(report=False)
+                else:
+                    raise e
             except CommClosedError:
                 logger.warning("Heartbeat to scheduler failed")
             finally:

I start a scheduler with dask-scheduler, connect a worker, then kill the scheduler. Going to write up a test now and make a PR

cc @scottyhq

I am using a local cluster and got the same error. Commenting this change out worked for me. As an alternative could you advise on how to change the “distributed.comm.timeouts.connect” setting?

Hmmm, I’m not sure what would be causing that. From the scheduler logs it looks like the workers may be dieing and restarting, and scatter may not be equipped to handle that properly. Can you look at the worker logs to see if there’s any revealing output? Workers may die if they use too much memory, or if deserialization leads to irrecoverable errors (e.g. segfaults).