ray: Ray Tune doesn't scale, scheduling performance degrades to less than 25% worker utilization with 32 workers.

What is the problem?

I run a grid search of 19440 trials using FIFOScheduler and the default search_alg with 32 worker CPUs.

I encounter 2 problems:

In the first few minutes 100% of workers are utilized. After ~1100 trials complete only ~6 worker CPUs are busy at a time, the rest of worker CPUs are idling, while the logging says “Number of trials: 2930/19440 (1000 PENDING, 32 RUNNING, 1898 TERMINATED)”. In other words, scheduling keeps busy less that 25% of worker CPUs.
Looking into log files I notice that at this point gcs_server.out is 16GB in size, filled with ~89 million lines like the following:

[2021-04-25 22:33:39,901 I 2215124 2215124] gcs_placement_group_scheduler.cc:141: Scheduling placement group __tune_19c488be__55f5f383, id: 4b7b88f0acf7d6dadc1ec0d726e39f4a, bundles size = 1
[2021-04-25 22:33:39,901 I 2215124 2215124] gcs_placement_group_scheduler.cc:150: Failed to schedule placement group __tune_19c488be__55f5f383, id: 4b7b88f0acf7d6dadc1ec0d726e39f4a, because no nodes are available.
[2021-04-25 22:33:39,901 I 2215124 2215124] gcs_placement_group_manager.cc:215: Failed to create placement group __tune_19c488be__55f5f383, id: 4b7b88f0acf7d6dadc1ec0d726e39f4a, try again.
[2021-04-25 22:33:39,902 I 2215124 2215124] gcs_placement_group_scheduler.cc:141: Scheduling placement group __tune_19c488be__45ec09b4, id: 164f45d0a5e0ad31aad8d7acfe455e0b, bundles size = 1
[2021-04-25 22:33:39,902 I 2215124 2215124] gcs_placement_group_scheduler.cc:150: Failed to schedule placement group __tune_19c488be__45ec09b4, id: 164f45d0a5e0ad31aad8d7acfe455e0b, because no nodes are available.
[2021-04-25 22:33:39,902 I 2215124 2215124] gcs_placement_group_manager.cc:215: Failed to create placement group __tune_19c488be__45ec09b4, id: 164f45d0a5e0ad31aad8d7acfe455e0b, try again.
[2021-04-25 22:33:39,902 I 2215124 2215124] gcs_placement_group_scheduler.cc:141: Scheduling placement group __tune_19c488be__2db3328e, id: 23066cf1865fe81ca4922400efcddd51, bundles size = 1
[2021-04-25 22:33:39,902 I 2215124 2215124] gcs_placement_group_scheduler.cc:150: Failed to schedule placement group __tune_19c488be__2db3328e, id: 23066cf1865fe81ca4922400efcddd51, because no nodes are available.
[2021-04-25 22:33:39,902 I 2215124 2215124] gcs_placement_group_manager.cc:215: Failed to create placement group __tune_19c488be__2db3328e, id: 23066cf1865fe81ca4922400efcddd51, try again.

OS: Ubuntu 20.04 Python: 3.8.8 Ray: 1.3.0

Minimal reproduction code

The following code reproduces both issues:

Scheduling performance degrades to less than 30% workers being utilized (32 total worker CPUs) after ~1100 completed trials (load_avg metric goes down at that point). The following warnings are observed after ~2000 completed trials:

2021-04-26 12:20:16,000	WARNING util.py:161 -- The `choose_trial_to_run` operation took 0.513 s, which may be a performance bottleneck.
2021-04-26 12:20:16,741	WARNING util.py:161 -- The `choose_trial_to_run` operation took 0.626 s, which may be a performance bottleneck.
2021-04-26 12:20:17,384	WARNING util.py:161 -- The `choose_trial_to_run` operation took 0.521 s, which may be a performance bottleneck.
2021-04-26 12:20:18,132	WARNING util.py:161 -- The `choose_trial_to_run` operation took 0.626 s, which may be a performance bottleneck.
2021-04-26 12:20:18,849	WARNING util.py:161 -- The `choose_trial_to_run` operation took 0.611 s, which may be a performance bottleneck.
2021-04-26 12:20:19,487	WARNING util.py:161 -- The `choose_trial_to_run` operation took 0.532 s, which may be a performance bottleneck.
2021-04-26 12:20:20,324	WARNING util.py:161 -- The `choose_trial_to_run` operation took 0.719 s, which may be a performance bottleneck.
2021-04-26 12:20:20,979	WARNING util.py:161 -- The `choose_trial_to_run` operation took 0.541 s, which may be a performance bottleneck.

gcs_server.out being spammed with log messages. Running the code produces gcs_server.out of 22GB.

The code itself:

import os
import zlib
import struct

import ray
from ray.tune.result import DEFAULT_RESULTS_DIR
from ray.tune.schedulers import FIFOScheduler


def cpu_hog(value, bits):
    mask = (1 << bits) - 1
    steps = 1
    while (value := zlib.crc32(struct.pack("<I", value))) & mask:
        steps += 1
    return steps, value


def tune_one(config):
    steps, score = cpu_hog(config["value"], config["complexity"])
    ray.tune.report(steps=steps, score=score, load_avg=os.getloadavg()[0])


def tune():
    ray.init(_temp_dir=os.path.join(DEFAULT_RESULTS_DIR, "tmp"))  # Otherwise, gcs_server.out fills up /tmp.
    ray.tune.run(
        tune_one,
        config={"value": ray.tune.grid_search(list(range(10000))), "complexity": 26},
        scheduler=FIFOScheduler()
    )


if __name__ == "__main__":
    tune()

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 24 (22 by maintainers)

Most upvoted comments

Thanks for confirming. I’ll submit a quick PR that sets TUNE_MAX_PENDING_TRIALS_PG to max(16, the number of available cluster CPUs divided by the number of required CPUs per trials) per default. With a minimum of 16 pending trials per default autoscaling should still be triggered reasonably fast. Note that this doesn’t limit the number of parallel (e.g. running trials), just the number of placement groups and trial objects that are created.

I’ll get to the TrialRunner._trials optimization in a different PR.

krfricke on May 4, 2021

Can you try the TUNE_MAX_PENDING_TRIALS_PG=x to solve the gcs_server.out problem?

I’m happy to work on removing finished trials from the TrialRunner._trials this week/early next week. We will probably include these fixes in a 1.3.1 patch release.

krfricke on May 3, 2021