ray: [Autoscaler] The autoscaler could not find a node type to satisfy the request
What happened + What you expected to happen
I am running the PPO Trainer with num_workers set to 8. It seems when I first launch an experiment after creating a new Ray Cluster, then everything gets scheduled and there are no errors with the autoscaler. After the experiment completes and I re-submit the Job, then I get the following error.
The autoscaler could not find a node type to satisfy the request: [{"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}, {"CPU": 0}]
My CPU Worker has the following resources: Request 8 CPUs and Limit 16 CPUs with sufficient memory.
I was able to bypass this error by creating a new worker node called cpu-worker-small that has 1 CPU and 10GI of memory; however, then the actors fail unexpectedly (probably to resource constraints).
I saw some similar issues https://github.com/ray-project/ray/issues/12441, but it seems after the first experiment runs there is no resource demands in the autoscaler. I am using default settings from Kuberay.
After running the first experiment I check the following autoscaler state: This means that this issue https://github.com/ray-project/ray/issues/24259 made it into Ray 1.12.1
{'18ba94e4efd31538df7740fec64192e7': {'bundles': {0: {'CPU': 1.0},
1: {'CPU': 8.0},
2: {'CPU': 8.0},
3: {'CPU': 8.0},
4: {'CPU': 8.0},
5: {'CPU': 8.0},
6: {'CPU': 8.0},
7: {'CPU': 8.0},
8: {'CPU': 8.0}},
'name': '__tune_5ba4d6b7__981c2416',
'placement_group_id': '18ba94e4efd31538df7740fec64192e7',
'state': 'REMOVED',
'stats': {'end_to_end_creation_latency_ms': 0.0,
'highest_retry_delay_ms': 1000.0,
'scheduling_attempt': 171,
'scheduling_latency_ms': 0.0,
'scheduling_state': 'REMOVED'},
'strategy': 'PACK'},
'18bb3aa16ae8d9e9edd5da23ead26838': {'bundles': {0: {'CPU': 1.0},
Versions / Dependencies
Ray 2.0 Kuberay on a Kubernetes Cluster (latest version).
Reproduction script
PPO with the following Configuration: { “num_gpus” : 1, “num_workers” : 2, “num_sgd_iter” : 60, “train_batch_size” : 12000, }
Issue Severity
High: It blocks me from completing my task.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 30 (30 by maintainers)
Got it, so the sequence of events is
I will try this out to see what’s going on.
Hi @DmitriGekhtman, sure thing!
I have submitted jobs back-to-back (2-3). Is there any other information which may be helpful?
Yeah, it is true that PACK is a “soft” constraint… starting to poke around now.