ray: Avoiding the `pending and cannot currently be scheduled` warning

Problem

It’s often easier to ‘fire and forget’ for starting actors when resources are not available (and waiting for Ray to autoscale). However, we often see this nefarious, notorious error message:

2020-01-13 18:15:09,447	WARNING worker.py:1062 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is pending and cannot currently be scheduled. It requires {CPU: 1.000000} for execution and {CPU: 1.000000} for placement, but this node only has remaining {node:172.22.225.108: 1.000000}, {CPU: 2.000000}, {memory: 16.113281 GiB}, {GPU: 1.000000}, {object_store_memory: 5.566406 GiB}. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

You might say “ok well you can solve that with placement groups”. Unfortunately, this introduces complexity in various ways:

User ends up needing to manage the creation of the placement group

Ideally, you have this:

actors = []
# start N actors
for i in range(actors):
    new_pg.start()
    actor.options(pg).remote() 
    actors += [actor]

But in order to avoid the warning message:

# start N actors
for i in range(actors):
    groups += [(new_pg.start(), new_pg.ready())]
    
while pg in ray.wait(groups):
    actor.options(pg).remote()
    actors += [actor]

If you want to reuse the placement group, you have to introduce a new state in your application level scheduling

Ideal:

# PG reuse
for old_actor, _ in ray.wait(actors):
	old_actor.stop.remote()
    pg = old_actor.get_pg()
    actors += [new_actor.options(pg).remote()]

But in order to avoid the warning message:

# PG reuse
stopping_actors = []
while True:
	actor, _ = ray.wait(actors, timeout=0.1)
    if actor:
        stopping_actors.append(actor.stop.remote())
    if stopping_actors:
        old_actor, _ = ray.wait(stopping_actors, timeout=0.1)
        if old_actor:
            pg = old_actor.get_pg()
            actors += [new_actor.options(pg).remote()]

Proposed Solution

Suppress the warning entirely for placement groups (just assume users know what they’re doing in that case).

cc @ericl @edoakes @wuisawesome @krfricke

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 29 (24 by maintainers)

Most upvoted comments

@pcmoritz does Eric’s above explanation make sense? If so, could you please +1?

It’d be good to get closure on this thread before we drag it on forever.

Suppress the warning entirely for placement groups (just assume users know what they’re doing in that case).

Sounds good, and pretty easy (just ignore if there is a placement group resource).

There are numerous solutions in this thread that satisfy my application requirements:

  1. Suppress the warning entirely for placement groups (just assume users know what they’re doing in that case).
  2. Suppressing the warning for placement groups with a flag (kai’s suggestion)
  3. Configure the warning to be raised if the task is infeasible more than X seconds
  4. Make the autoscaler print the warning instead of worker.py
  5. Having some RayConfig that disables the warning upon task and actors.

Whether or not there’s a ‘core api change’ is up to whichever solution we decide on. Personally, I like 1.

@wuisawesome that sounds good, maybe push a draft PR and we can take a look?