ray: [Bug] [Core] Unable to schedule fractional gpu jobs

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

Please find the minimal reproducible example below. I’m trying to run the following script on a ray cluster with two nodes and each node has 8 GPUs:

 import ray

 ray.init(address="auto")

 required_gpus = 0.6
 n_actors = 10

 @ray.remote(num_gpus=required_gpus)
 class A:
     def __init__(self, idx):
         self.idx = idx
     def f(self):
         return self.idx

 print(ray.cluster_resources())
 print("-" * 10)

 actors = [A.remote(i) for i in range(n_actors)]
 ray.get([a.f.remote() for a in actors])

The program will hang forever with the following message:

$ python test_ray.py
2021-12-07 06:40:15,355 INFO worker.py:843 -- Connecting to existing Ray cluster at address: 172.31.50.151:6379
{'object_store_memory': 308670280089.0, 'CPU': 128.0, 'accelerator_type:V100': 2.0, 'GPU': 16.0, 'memory': 710230653543.0, 'node:172.31.50.151': 1.0, 'node:172.31.53.4': 1.0}
----------
2021-12-07 06:40:33,496 WARNING worker.py:1245 -- The actor or task with ID ffffffffffffffffd01a37602c435349b99b9d1d09000000 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {GPU: 0.600000}, {CPU: 1.000000}
Available resources on this node: {56.000000/64.000000 CPU, 17097537210.009766 GiB/17097537210.009766 GiB memory, 3.200000/8.000000 GPU, 7536779339.990234 GiB/7536779339.990234 GiB object_store_memory, 1.000000/1.000000 accelerator_type:V100, 1.000000/1.000000 node:172.31.50.151}
 In total there are 0 pending tasks and 2 pending actors on this node.

Clearly it’s possible to schedule 10 required_gpus=0.6 actors on 16 GPUs cluster.

The program will pass when I set required_gpus=0.9 and n_actors=10, or when I set num_gpus=0.25 and n_actors=40.

I think the bug is caused by the following: After the scheduler scheduled 8 required_gpus=0.6 actors on a node, it thought the node still has 8 - 8 * 0.6 = 3.2 GPUs so it tries to schedule the actor onto the same node, but actually, it’s impossible to fit this actor on that node.

Versions / Dependencies

I tried on both v.1.9.0 and the nightly version. Both failed.

Reproduction script

See the example above.

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

About this issue

Original URL
State: open
Created 3 years ago
Comments: 15 (14 by maintainers)

Most upvoted comments

I think there are 3 possible solutions.

Only allow values that match the concept of “resource instance”. For example, 0.25, 0.5, 0.125. This will be the easiest way to avoid confusion, but users won’t be able to specify values such as 0.3.
Improve error messages. This is not trivial since the error messages are generated from the autoscaler now, which don’t know the details about resource instance information. In order to implement this, we should raise error messages directly from raylet.
Maybe we can modify the abstraction for GPU specification to use something other than num_gpus. num_gpus have same semantics as num_cpus, and it has completely different behavior, which means it is a pretty bad abstraction. We need detailed proposal for this approach.

rkooo567 on Feb 7, 2022