ray: [Ray component: Core] Add way to fix problems with ray::IDLE workers taking up too many resources

Description

Ray has --num-cpus to limit the number of cpus used on a node, but this does not limit the number of ray::IDLE workers that exist. So for example after running: ray start --address=10.159.8.149:53454 --num-cpus 4 I have seen over 90 ray::IDLE workers created. Each of these workers uses cpu and memory, which results in significant resource use.

On stack overflow, use of ray.init(local_mode=True) https://stackoverflow.com/a/63231293/18954005 was suggested, but that basically removes parallelism.

An alternative work around is use of --min-worker-port and --max-worker-port to restrict the number, but if ports are already used by some other process, that can cause fewer workers to be created than desired.

Use case

I would like to be able to limit the resources used by ray::IDLE workers.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 30 (9 by maintainers)

Most upvoted comments

For what it is worth, the only way I have found to decrease the amount of resources that ray::IDLE workers use was to switch to dask 😦

I have the same question. The Ray State is Idle, however, the CPU is 100% full and the memory is increasing gradually. I am curious about what happens in the Ray. At the same moment, all the Ray subProcess is finished, I have printed the log. image

Then the OOM happened. image