ray: [core] occasional Ray port conflict issue

What is the problem?

encountered an issue with occasional Ray port conflict. Ray component is trying to use a port number xxx that is used by other components. Ray 1.3.0

Reproduction (REQUIRED)

start the head node as normal, start the worker node with command below in a could: ray start --address=agent10909-phx4.prod.uber.internal:31014 --object-manager-port=31009 --worker-port-list=31034,31035,31046,31047,31048,31049,31061,31062,31063,31064,31065,31066 --num-cpus=10 --num-gpus=1 --block

We estimate 1 out of 100 run, this issue will happen.

The worker node won’t be able to start. log looks like below. I see Ray itself pickup same port for dashboard_agent and metrics_export, which we didn’t specify in our ray start up command.

2021-08-22 07:50:43,072 INFO : worker_ports_str is 31034,31035,31046,31047,31048,31049,31061,31062,31063,31064,31065,31066
2021-08-22 07:50:43,073 INFO : Running ray worker with ray start --address=agent10909-phx4.prod.uber.internal:31014 --object-manager-port=31009 --worker-port-list=31034,31035,31046,31047,31048,31049,31061,31062,31063,31064,31065,31066 --num-cpus=10 --num-gpus=1 --block
/usr/lib/python3.6/site-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
"update your install command.", FutureWarning)
Traceback (most recent call last):
File "/usr/bin/ray", line 8, in <module>
sys.exit(main())
File "/usr/lib/python3.6/site-packages/ray/scripts/scripts.py", line 1706, in main
return cli()
File "/usr/lib/python3.6/site-packages/click/core.py", line 1137, in _call_
return self.main(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.6/site-packages/click/core.py", line 1668, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3.6/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.6/site-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/ray/scripts/scripts.py", line 657, in start
ray_params, head=False, shutdown_at_exit=block, spawn_reaper=block)
File "/usr/lib/python3.6/site-packages/ray/node.py", line 223, in _init_
self._ray_params.update_pre_selected_port()
File "/usr/lib/python3.6/site-packages/ray/_private/parameter.py", line 297, in update_pre_selected_port

ValueError: Ray component metrics_export is trying to use a port number 61240 that is used by other components.

Port information: {'gcs': [], 'object_manager': [31009], 'node_manager': [], 'gcs_server': [], 'client_server': [10001], 'dashboard': [8265], 'dashboard_agent': [61240], 'metrics_export': [61240], 'redis_shards': [], 'worker_ports': [31034, 31035, 31046, 31047, 31048, 31049, 31061, 31062, 31063, 31064, 31065, 31066]}
If you allocate ports, please make sure the same port is not used by multiple components.
I0822 07:50:44.000429     9 executor.cpp:1015] Command exited with status 0 (pid: 73)
I0822 07:50:45.002389    72 process.cpp:927] Stopped the socket accept loop
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 1
  • Comments: 29 (20 by maintainers)

Most upvoted comments

I strongly recommend you to set every port manually when you deploy Ray https://docs.ray.io/en/master/ray-core/configure.html#ports-configurations to avoid port conflict.

It’s good that the issue can be worked around by coding up some port allocation logic, or retrying. It’s pretty bad that the basic Ray start API has a random chance of failure, for known reasons.

This wasn’t fixed in the master. It happens because once in 100 times, the port randomly selected for agent & metrics conflict. We need to avoid choosing a random port when it is already assigned to sth else

In our application, this seems to occur with more than a 1% probability. It has happened multiple times in the past year and a simple restart cannot solve it. We use Docker to deploy nodes, strangely there is always the same port conflict every time restarts.

random means it is chosen when a process starts randomly. Some of procs cannot do this due to some implementation limitation, and they pre-choose a port (or it is hardcoded). Generally, if it is deployed in prod, it is a good idea to set all ports manually.

This wasn’t fixed in the master. It happens because once in 100 times, the port randomly selected for agent & metrics conflict. We need to avoid choosing a random port when it is already assigned to sth else

Let me check this soon

Hi @rkooo567, it seems the ticket is automatically closed, and I wonder if this has been fixed. Thanks!