ray: [Ray component: Core] Add way to fix problems with ray::IDLE workers taking up too many resources
Description
Ray has --num-cpus to limit the number of cpus used on a node, but this does not limit the number of ray::IDLE workers that exist. So for example after running: ray start --address=10.159.8.149:53454 --num-cpus 4 I have seen over 90 ray::IDLE workers created. Each of these workers uses cpu and memory, which results in significant resource use.
On stack overflow, use of ray.init(local_mode=True) https://stackoverflow.com/a/63231293/18954005 was suggested, but that basically removes parallelism.
An alternative work around is use of --min-worker-port and --max-worker-port to restrict the number, but if ports are already used by some other process, that can cause fewer workers to be created than desired.
Use case
I would like to be able to limit the resources used by ray::IDLE workers.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 30 (9 by maintainers)
For what it is worth, the only way I have found to decrease the amount of resources that ray::IDLE workers use was to switch to dask 😦
I have the same question. The Ray State is Idle, however, the CPU is 100% full and the memory is increasing gradually. I am curious about what happens in the Ray. At the same moment, all the Ray subProcess is finished, I have printed the log.
Then the OOM happened.