ray: "slow start" launching worker processes on new nodes

What is the problem?

This is with Ray 1.0.1 on Ubuntu 20.04 on AWS c5a servers.

I create a cluster with 13 worker computers, each a c5a-16xlarge node on AWS, so 64 vCPUs per worker computer (13*64 = 832). Watching the Ray dashboard, I can see the number of workers, at the bottom, versus the number of cores. It takes somewhere between 3.5 and 4 minutes for the number of worker processes to equal the number of cores.

Under Ray 1.0.0, the worker processes were launched at the start, and I’d immediately have full CPU utilization across my cluster. Now it takes nearly four minutes. Once some other issues get resolved, I’d like to increase the number of vCPUs by a factor of ten or more, at which point this “slow start” behavior would be the gating factor in my ability to achieve scalable performance.

(Right now, some other unrelated bugs are limiting my ability to add more workers.)

Reproduction (REQUIRED)

I ran a demo for @rkooo567 so he could see the behavior.

Desired fix? Some way of telling Ray to start worker processes immediately when nodes are launched.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 22 (10 by maintainers)

Most upvoted comments

@rkooo567 can we come up with a simpler repro? The problem with a complex repro is that the issue can always be an issue in the application code. Have you tried reproducing this on that cluster with a simple wave of tasks?

Another thing to try is reproducing the issue on a different cluster. If it’s not possible there might be some environment specific problem (slow NFS mount, etc).

ericl on Nov 18, 2020