ray: [core] "soft" option for NodeAffinitySchedulingStrategy doesn't seem to work

What happened + What you expected to happen

The problem

The following script should be able to utilize all CPU slots on a cluster, but when you run it it will only use CPUs on the head node.

This is a blocker for Dataset streaming, which uses this scheduling hint for best-effort locality optimization. Not supporting “soft” properly means locality scheduling can severely degrade performance.

Versions / Dependencies

master / py39

Reproduction script

import time
import ray
from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy

ray.init()

cur_node = ray.get_runtime_context().get_node_id()

@ray.remote
def f():
    time.sleep(1)
    print("done")

start = time.time()
ray.get([
    f.options(scheduling_strategy=NodeAffinitySchedulingStrategy(cur_node, soft=True)).remote()
    for _ in range(1000)
])
print(time.time() - start)

Reproduction workspace: https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_1ncv7mqbfapj4ax5sa1679seep/ses_msttcvnbvzbgzefa2hia2kui9h?command-history-section=head_start_up_log

Issue Severity

High: It blocks me from completing my task.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 24 (19 by maintainers)

Commits related to this issue

Most upvoted comments

I think the ideal case would be for soft to just have one meaning, “spill on busy or failure”, but we should only do this if the implementation for “on busy” is good enough. What @jjyao said about the shuffle use case is right, but I don’t believe we necessarily need “spill on failure” semantics; I think it’s just that the performance is more predictable with this policy. I’m not sure if there are other use cases that need “spill on failure” but not “spill on busy”.

I’d say we should try out “spill on busy or failure” on the push-based shuffle release test and also add some CI tests for the behavior. If these results are good, we can change the semantics. Otherwise, we should probably just add a second flag.

If we want well defined semantics, I think we should have explicit flags about the behaviors, such as spill_on_failure=True, spill_on_busy=True. Otherwise, I’d favor soft providing best-effort scheduling, defined as “try to put it here, but if not possible, fall back to other nodes. Your code should be able to tolerate the task getting placed on other nodes”.