ray: [Bug] Unable to Schedule Fractional CPUs on Windows 10

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

The Problem:

With enough available resources (i.e. 12 CPUs), Actor number 13 is always pending regardless of the fractional CPU assigned to it. Then after that, most Actors keep pending.

3

4


Resources

Resources Status during Actors Creation (12 Logical CPUs / 6 Physical Cores, 32GB RAM):

2

5

1

Versions / Dependencies

Ray version 1.9.0

Redis version 4.0.2

Reproduction script

  import time
  import ray
  
  @ray.remote(num_cpus=0.01)  # tried different values
  class Actor:
      def __init__(self):
          pass
  
  if __name__ == '__main__':
      try:
          ray.init(num_cpus=12)  # tried different values
          time.sleep(30)
          for i in range(50):
              Actor.options(name=str(i + 1), lifetime="detached").remote()
              time.sleep(10)
      except Exception as e:
          print("Exception {} ".format(str(e)))
      finally:
          ray.shutdown()

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 37 (24 by maintainers)

Most upvoted comments

That’s awesome! Thanks for your help resolving this (and reporting it!).

Yes it will be part of the 1.10 release. In 1.10 we will likely declare single node Ray core on Windows stable, I actually think that https://github.com/ray-project/ray/pull/20986 was the last big bug of Ray core on Windows given that many tests started passing after it was merged (fingers crossed). There are some components that need more work (like runtime environments) that are not used by default and we will address those in Q1 of 2021.

I’m going to go ahead and close this issue, @czgdp1807 can you open a separate one for the investigation about the warning for users if actors are waiting to be scheduled for an extended period of time due to lack of resources?

https://s3-us-west-2.amazonaws.com/ray-wheels/master/4dcba1d0f43880bf22297f1bc9478930b85038ac/ray-2.0.0.dev0-cp39-cp39-win_amd64.whl

Cheers. Totally forget about it. I spent my development life in Ubuntu. Just switched to Windows recently!

Thanks for trying 😃 Which version of python are you using? For Python 3.7, you have to install the cp37 wheel, for Python 3.8 the cp38 wheel and for Python 3.9 the cp39 wheel (and I assume this is on windows).

Yes, num_cpus_ray is independent of the actual resources of the machine and can be higher (the OS scheduler will time slice between the actors/processes). For CPU intensive workloads setting it too high not recommended, but for IO intensive workloads it can be useful. However making it really really high can lead to OOM problems, since each actor requires some amount of memory (python interpreter + libraries).

@czgdp1807 It is expected that if num_cpus_ray gets exhausted by the resource requirements by the actors, no more actors are scheduled (note that this is different from the original issue reporting here – there num_cpus_ray was not exhausted, since the actors only required a very small amount of fractional resources). However, in that case it should give a message that there are currently unschedulable actors (unless autoscaling is enabled, which on a single node cluster/on your laptop it is probably not). If there is no such message, that is a bug.

Note that this is a different issue than the one originally reported – it is more of a usability issue but nonetheless important since it can confuse users. We should create a new issue to track it.

@scv119 Can you have a quick look at the above message ^? This looks like a bug, if there is only a single node in the cluster, tasks shouldn’t be spilled, right? Who on the core team should help with this/knows this code the best?

If actors cannot be scheduled due to lack of CPUs (and the actor requires a CPU), I would expect a message to the user stating so, unless the cluster has an autoscaler running.

A minor update. There is an anomaly which I found while going through logs/gcs_server.out.

For a worker which went ALIVE,

[2021-12-14 17:30:45,683 I 2524 5244] gcs_actor_scheduler.cc:213: Start leasing worker from node d6aa91a54d5543c2d98800ce1c05cbfbcb4bea77b16c3345e24e25b4 for actor 5ff868da9c04fcdda92c70b801000000, job id = 01000000
[2021-12-14 17:30:45,689 I 2524 5244] gcs_actor_scheduler.cc:536: Finished leasing worker from d6aa91a54d5543c2d98800ce1c05cbfbcb4bea77b16c3345e24e25b4 for actor 5ff868da9c04fcdda92c70b801000000, job id = 01000000

For a worker which was always in PENDING_CREATION state,

[2021-12-14 17:30:46,681 I 2524 5244] gcs_actor_scheduler.cc:213: Start leasing worker from node d6aa91a54d5543c2d98800ce1c05cbfbcb4bea77b16c3345e24e25b4 for actor 5d2144bb196d3d7f049c176701000000, job id = 01000000
[2021-12-14 17:30:47,679 W 2524 5244] gcs_actor_manager.cc:321: Actor with name '3' was not found.

So, the difference is that worker for actor in the second case was never leased or atleast leasing never finished. And may be that’s why the actor never went alive?

cc: @wuisawesome

Yes this looks like the same issue as https://discuss.ray.io/t/with-enough-available-resources-most-of-the-actors-creation-is-pending/4364. Thanks for raising awareness on Github @John-Almardeny.

hmm; looks a windows related issue; @John-Almardeny unfortunately the Windows support is experimental.

when the actor refuse to schedule, do you notice any useful information in the driver output?

cc @pcmoritz