ray: [Autoscaler] The autoscaler could not find a node type to satisfy the request

What happened + What you expected to happen

I am running the PPO Trainer with num_workers set to 8. It seems when I first launch an experiment after creating a new Ray Cluster, then everything gets scheduled and there are no errors with the autoscaler. After the experiment completes and I re-submit the Job, then I get the following error.

The autoscaler could not find a node type to satisfy the request: [{"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}]

My CPU Worker has the following resources: Request 8 CPUs and Limit 16 CPUs with sufficient memory.

I was able to bypass this error by creating a new worker node called cpu-worker-small that has 1 CPU and 10GI of memory; however, then the actors fail unexpectedly (probably to resource constraints).

I saw some similar issues https://github.com/ray-project/ray/issues/12441, but it seems after the first experiment runs there is no resource demands in the autoscaler. I am using default settings from Kuberay.

After running the first experiment I check the following autoscaler state: This means that this issue https://github.com/ray-project/ray/issues/24259 made it into Ray 1.12.1

{'18ba94e4efd31538df7740fec64192e7': {'bundles': {0: {'CPU': 1.0},
                                                  1: {'CPU': 8.0},
                                                  2: {'CPU': 8.0},
                                                  3: {'CPU': 8.0},
                                                  4: {'CPU': 8.0},
                                                  5: {'CPU': 8.0},
                                                  6: {'CPU': 8.0},
                                                  7: {'CPU': 8.0},
                                                  8: {'CPU': 8.0}},
                                      'name': '__tune_5ba4d6b7__981c2416',
                                      'placement_group_id': '18ba94e4efd31538df7740fec64192e7',
                                      'state': 'REMOVED',
                                      'stats': {'end_to_end_creation_latency_ms': 0.0,
                                                'highest_retry_delay_ms': 1000.0,
                                                'scheduling_attempt': 171,
                                                'scheduling_latency_ms': 0.0,
                                                'scheduling_state': 'REMOVED'},
                                      'strategy': 'PACK'},
 '18bb3aa16ae8d9e9edd5da23ead26838': {'bundles': {0: {'CPU': 1.0},

Versions / Dependencies

Ray 2.0 Kuberay on a Kubernetes Cluster (latest version).

Reproduction script

PPO with the following Configuration: { “num_gpus” : 1, “num_workers” : 2, “num_sgd_iter” : 60, “train_batch_size” : 12000, }

Issue Severity

High: It blocks me from completing my task.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 30 (30 by maintainers)

Most upvoted comments

Got it, so the sequence of events is

  1. Create cluster
  2. Submit workload
  3. Let workload run to completion, wait for scale-down
  4. Submit workload again
  5. Autoscaler produces a quizzical error about a bundle of CPU:0 requests and refuses to scale up

I will try this out to see what’s going on.

Hi @DmitriGekhtman, sure thing!

import numpy as np
from ray import tune

analysis = tune.run(
    "PPO",
    stop={"episode_reward_mean": 200000},
    config={
        "env": "CartPole-v1",
        "num_gpus": 0,
        "num_workers": 8,
        "lr": tune.grid_search(list(np.arange(0.0, 0.999, 0.2)))
    },
)

I have submitted jobs back-to-back (2-3). Is there any other information which may be helpful?

Yeah, it is true that PACK is a “soft” constraint… starting to poke around now.